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1  Executive  Summary 

During  the  design  proeess  for  an  aerospaee  vehiele,  deeision-makers  must  have  an  aecurate 
understanding  of  how  eaeh  ehoiee  will  aflfeet  the  vehiele  and  its  performanee.  This 
understanding  is  based  on  experiments  and,  inereasingly  often,  eomputer  models.  In  general, 
as  a  eomputer  model  eaptures  a  greater  number  of  phenomena,  its  results  become  more 
accurate  for  a  broader  range  of  problems.  This  improved  accuracy  typically  comes  at  the  cost 
of  significantly  increased  computational  expense  per  analysis. 

Although  rapid  analysis  tools  have  been  developed  that  are  suflhcient  for  many  design  efforts, 
those  tools  may  not  be  accurate  enough  for  revolutionary  concepts  subject  to  grueling  flight 
conditions  such  as  transonic  or  supersonic  flight  and  extreme  angles  of  attack.  At  such 
conditions,  the  simplifying  assumptions  of  the  rapid  tools  no  longer  hold.  Accurate  analysis  of 
such  concepts  would  require  models  that  do  not  make  those  simplifying  assumptions,  with  the 
corresponding  increases  in  computational  effort  per  analysis.  As  computational  costs  rise, 
exploration  of  the  design  space  can  become  exceedingly  expensive.  If  this  expense  cannot  be 
reduced,  decision-makers  would  be  forced  to  choose  between  a  thorough  exploration  of  the 
design  space  using  inaccurate  models,  or  the  analysis  of  a  sparse  set  of  options  using  accurate 
models.  This  problem  is  exacerbated  as  the  number  of  free  parameters  increases,  limiting  the 
number  of  trades  that  can  be  investigated  in  a  given  time.  In  the  face  of  limited  resources,  it 
can  become  critically  important  that  only  the  most  useful  experiments  be  performed,  which 
raises  multiple  questions:  how  can  the  most  useful  experiments  be  identified,  and  how  can 
experimental  results  be  used  in  the  most  effective  manner? 

This  research  effort  focuses  on  identifying  and  applying  techniques  which  could  address  these 
questions.  The  demonstration  problem  for  this  effort  was  the  modeling  of  a  reusable  booster 
vehicle,  which  would  be  subject  to  a  wide  range  of  flight  conditions  while  returning  to  its 
launch  site  after  staging.  Contour-based  sampling,  an  adaptive  sampling  technique,  seeks  cases 
that  will  improve  the  prediction  accuracy  of  surrogate  models  for  particular  ranges  of  the 
responses  of  interest.  In  the  case  of  the  reusable  booster,  contour-based  sampling  was  used  to 
emphasize  configurations  with  small  pitching  moments;  the  broad  design  space  included 
many  configurations  which  produced  uncontrollable  aerodynamic  moments  for  at  least  one 
flight  condition.  By  emphasizing  designs  that  were  likely  to  trim  over  the  entire  trajectory, 
contour-based  sampling  improves  the  predictive  accuracy  of  surrogate  models  for  such  designs 
while  minimizing  the  number  of  analyses  required. 

The  simplified  models  mentioned  above,  although  less  accurate  for  extreme  flight  conditions, 
can  still  be  useful  for  analyzing  performance  at  more  common  flight  conditions.  The 
simplified  models  may  also  offer  insight  into  trends  in  the  response  behavior.  Data  from 
these  simplified  models  can  be  combined  with  more  accurate  results  to  produce  useful  surrogate 
models  with  better  accuracy  than  the  simplified  models  but  at  less  cost  than  if  only  expensive 
analyses  were  used.  Of  the  data  fusion  techniques  evaluated,  Ghoreyshi  cokriging  was  found  to 
be  the  most  effective  for  the  problem  at  hand. 

Lastly,  uncertainty  present  in  the  data  was  found  to  negatively  affect  predictive  accuracy  of 
surrogate  models.  Most  surrogate  modeling  techniques  neglect  uncertainty  in  the  data  and 
treat  all  cases  as  deterministic.  This  is  plausible,  especially  for  data  produced  by  computer 
analyses  which  are  assumed  to  be  perfectly  repeatable  and  thus  truly  deterministic.  However,  a 
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number  of  sources  of  uncertainty,  such  as  solver  iteration  or  surrogate  model  prediction  accuracy, 
can  introduce  noise  to  the  data.  If  these  sources  of  uncertainty  could  be  captured  and 
incorporated  when  surrogate  models  are  trained,  the  resulting  surrogate  models  would  be  less 
susceptible  to  that  noise  and  correspondingly  have  better  predictive  accuracy.  This  was 
accomplished  in  the  present  elfort  by  capturing  the  uncertainty  information  via  nuggets  added  to 
the  Kriging  model. 

By  combining  these  techniques,  surrogate  models  could  be  created  which  exhibited  better 
predictive  accuracy  while  selecting  the  most  informative  experiments  possible.  This 
significantly  reduced  the  computational  effort  expended  compared  to  a  more  standard  approach 
using  space-filling  samples  and  data  from  a  single  source.  The  relative  contributions  of  each 
technique  were  identified,  and  observations  were  made  pertaining  to  the  most  effective  way  to 
apply  the  separate  and  combined  methods. 
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2  Background  and  Scope 

The  Air  Force  has  looked  at  a  reusable  booster  stage  (RBS)  with  an  expendable  upper  stage  as 
a  more  cost  effective  and  responsive  replacement  for  the  current  expendable  launch  vehicles.  In 
order  to  meet  Milestone  B  criteria  to  develop  this  new  launch  capability,  AFRL/RQ  has  worked 
many  different  technology  areas  to  develop  an  RBS  including  plans  for  flight  demonstrators. 

One  of  the  fundamental  tradeoffs  for  a  RBS  is  the  method  in  which  the  reusable  booster 
performs  its  return  to  launch  site  (RTFS)  trajectory.  Previous  trade  studies  have  examined 
glideback,  jetback,  and  rocketback  trajectories.  These  studies  showed  that  the  rocketback 
trajectory,  characterized  by  using  the  vehicle’s  main  propulsion  system  to  impart  enough 
velocity  to  return  to  the  launch  site,  provides  operational  advantages  when  compared  to 
glideback  or  jetback  trajectories. 

The  Georgia  Institute  of  Technology  School  of  Aerospace  Engineering’s  Aerospace  Systems 
Design  Laboratory  (ASDL),  directed  by  Professor  Dimitri  N.  Mavris,  specializes  in  the 
advancement  of  systems  engineering  methods  and  the  emphasis  on  Early-Phase  Systems 
Engineering,  during  which  the  most  fundamental  (and  most  difficult  to  reverse)  decisions  are 
made.  Best-in-class  guidance  derives  an  approach  that  leverages  parametric  design  techniques 
to  aid  informed  decision-making  under  uncertainty  for  system  design.  Key  elements  of  this 
approach  include:  design-of-experiments,  surrogate  models,  probabilistic  analysis,  and 
parametric  visualization  tools  to  support  electronic  design  reviews.  In  addition,  an  important 
enabler  utilized  in  Task  C  was  the  Department  of  Defense  (DoD)  High  Performance 
Computing  Clusters  (HPCC).  Typically,  computing  resources  limit  what  can  be  accomplished. 
Without  the  HPCC  use,  the  ability  to  apply  advanced  design  methods  for  a  study  of  this 
magnitude  would  have  been  infeasible.  Other  enablers  for  this  study  were  the  robust  and 
automated  aerodynamic  analysis  and  the  Pacelab  Suite  geometry-centric  systems  engineering 
modeling  environment. 

The  objective  of  the  task  order  is  to  further  explore  the  RBS  aeromechanic  and  vehicle 
configuration  tradespace  in  order  to  reduce  the  risks  inherent  in  this  new  design.  The  research 
was  conducted  with  three  tasks  known  from  here  on  as: 

•  Task  A  Parametric  TrajectoryWehicle  Design  Study  (Performed  by  Georgia  Tech 
Aerospace  Systems  Design  Laboratory) 

•  Task  B  Probabilistic  Methods  for  Design  Margin  (Performed  by  Georgia  Tech 
Aerospace  Systems  Design  Laboratory) 

•  Task  C  Creation  of  Multi-Eidelity  Aerodynamic  Analysis  Tool  (Performed  by  the 
Georgia  Tech  Aerospace  Systems  Design  Laboratory) 

Task  A,  a  parametric  study  of  rocketback  trajectories,  involved  varying  many  different  vehicle 
/  trajectory  design  options  and  a  comparison  of  the  RETS  over  the  top  versus  underneath 
simulations.  Task  B  consists  primarily  of  developing  a  visualization  environment  of  the 
trajectories  developed  in  Task  A,  as  well  as  exploring  potential  Eigures  of  Merit  for 
Rocketback  demonstrators.  Task  C  was  for  the  development  of  an  integrated  packaging  and 
trim  analysis  tool  that  can  be  used  for  exploring  a  large  RBS  tradespace,  including 
demonstrator-sized  vehicles.  Tasks  A  and  C  were  larger  in  effort  than  Task  B. 
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The  following  sections  describe  the  research  and  results  for  Task  C.  The  Appendices  contain 
more  detailed  information  on  the  models  and  results,  as  well  as  software  documentation  for  the 
two  MATLAB  environments  that  were  delivered  to  AFRL. 
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3  Background  Information 

The  Air  Force  has  looked  at  a  RBS  with  an  expendable  upper  stage  as  a  more  cost  effective  and 
responsive  replacement  for  the  current  expendable  launch  vehicles.  In  order  to  meet  Milestone 
B  criteria  to  develop  this  new  launch  capability,  AFRL/RQ  has  worked  many  different 
technology  areas  to  develop  an  RBS  including  plans  for  flight  demonstrators. 

The  rocketback  RTFS  maneuver  is  a  major  technology  contributor  to  developing  a  cost 
effective  RBS.  Since  there  are  no  concrete  operational  vehicle  designs,  there  is  a  large 
potential  tradespace  as  to  how  this  rocketback  trajectory  will  be  executed  in  an  operational 
system.  For  this  activity  a  baseline  reference  RBS  operational  system  was  designed.  Then  the 
vehicle  closure  analyses  process  used  for  the  nominal  RBS  was  repeated  for  a  variety  of  RBS 
designs.  This  parametric  study  of  rocketback  trajectories  involved  varying  many  different 
vehicle  /  trajectory  design  options  such  as  but  not  limited  to:  staging  flight  conditions,  engine 
throttleability,  engine  restart,  wing  loading,  aerodynamics,  rotation  rate  and  direction,  and 
trajectory  simulation  control  methods.  Impacts  to  the  operational  RBS  such  as  payload  to  orbit, 
stage  weight,  aeroheating,  loads,  and  engine  requirements  were  quantified.  Additionally, 
several  point  design  RBS  vehicles  were  sized.  These  point  designs  looked  at  a  5klb  payload 
class  solution,  an  optimizer  controlled  booster  engine  throttle  setting  simulation,  and  two  end- 
of-life  mission  variants  Also,  a  comparison  of  “Over  the  top”  versus  “Underneath”  rocketback 
RTFS  trajectory  simulations  was  conducted.  Finally,  relevant  conclusions  were  drawn 
regarding  which  trajectory  and  design  parameters  are  most  critical  to  the  operational  RBS 
metrics  of  interest. 

3. 1  Introduction 

As  an  aerospace  vehicle  design  project  progresses,  a  variety  of  decisions  must  be  made  which 
range  from  the  very  large-scale  -  should  the  vehicle  be  fixed-wing,  rotary-wing,  or  lighter- 
than-air  -  to  the  very  small-scale  -  where  should  each  rivet  be  placed?  In  order  to  make  these 
decisions,  designers  rely  on  information  from  a  variety  of  sources  such  as  experience,  intuition, 
direct  experimentation,  and  simulation.  All  this  information  serves  one  ultimate  purpose: 
aiding  the  decision-maker  by  revealing  the  consequences  of  his  or  her  decisions.  Effective 
decision-making  requires  accurate  knowledge  of  how  each  decision  will  affect  the  overall 
vehicle  and  its  performance.  Without  accurate  knowledge  of  these  consequences,  a  decision¬ 
maker  risks  making  bad  choices  which  can  lead  to  poor  vehicle  performance,  program  delays, 
or  complete  project  failure.  Care  should  be  taken  to  ensure  that  the  consequences  of  a  decision 
are  accurately  understood  before  a  choice  is  made. 


3.2  Phases  of  the  Design  Process 

As  the  design  process  moves  forward,  the  vehicle  in  question  is  progressively  defined  and 
refined.  In  aerospace  applications,  the  stages  of  the  design  process  are  typically  referred  to  as 
Conceptual,  Preliminary,  and  Detailed  Design. [162] 

During  conceptual  design,  the  widest  possible  design  space  is  explored.  Extremely  rapid 
analysis  techniques  are  used  to  evaluate  as  many  options  as  possible.  An  analysis  is  any 
approach  that  quantifies  performance  given  a  set  of  input  values,  and  may  be  as  simple  as  an 
equation  or  as  complex  as  a  full-scale  flight  test.  In  conceptual  design,  the  analyses  will  mostly 
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be  simplified  models  that  can  quickly  estimate  the  performance  of  possible  designs.  The 
objective  of  this  phase  is  to  identify  one  or  more  promising  vehicle  options.  The  degree  of 
concept  definition  at  the  end  of  this  phase  may  vary  from  program  to  program:  Li  et  al.[104] 
fixed  the  planform  of  the  wing  (i.e.,  span,  taper,  sweep)  by  the  end  of  conceptual  design,  leaving 
only  the  airfoil  undecided,  while  Hutchins  et  al.[84]  carried  all  of  these  parameters  into 
preliminary  design  to  perform  further  trade  studies.  During  this  phase,  the  geometry  of  the 
vehicle  is  defined  using  a  few  parameters,  typically  less  than  ten  or  fifteen.  [104]  In  general, 
however,  conceptual  design  focuses  on  broad  definitions  such  as  initial  vehicle  sizing  and 
configuration  selection. 

During  the  next  stage,  preliminary  design,  the  concept  or  concepts  selected  during 
conceptual  design  are  further  refined  using  more  accurate  analysis  methods.  If  analyses 
capturing  multi-  disciplinary  effects  such  as  aeroelasticity  were  not  included  during  the 
conceptual  phase,  they  must  be  performed  now.  This  phase  emphasizes  the  use  of  higher- 
fidelity  models  for  more  accurate  estimation  of  concept  performance,  with  an  associated 
increase  in  computational  cost  per  analysis.  The  geometry  of  the  vehicle  is  defined  in  greater 
detail,  increasing  the  number  of  shape  parameters  to  a  few  dozen  or  as  many  as  a  few 
hundred.[104]  Most  or  all  of  the  vehicle’s  geometric  shape  is  frozen  by  the  end  of  this  phase. [89] 

Finally,  detailed  design  emphasizes  individual  components  of  the  vehicle  such  as  ribs  or  fuel 
tanks.  Each  component  is  sized  to  meet  its  expected  requirements  and  defined  at  the  level  of 
detail  required  for  manufacturing.  The  final,  most  highly  refined  estimates  of  weight  and 
vehicle  performance  are  made,  and  prototypes  are  constructed.  This  is  the  longest  phase  of  a 
design  project. 

Each  phase  of  the  design  process  is  marked  by  an  increase  in  the  available  information  about  the 
design.  Early  on,  very  rough  analyses  may  be  performed  with  only  a  few  details  about  the 
concept,  such  as  take-off  weight,  maximum  thrust,  and  wing  area.  Conversely,  by  the  end  of 
detailed  design  there  may  be  thousands  of  components,  each  defined  via  tens  or  hundreds  of 
parameters,  which  must  be  included  in  order  to  assess  the  performance  of  the  vehicle  as  a  whole. 
The  use  of  rapid,  simplified  models  in  the  early  phases  of  the  design  process  introduces  some 
element  of  risk.  If  the  models  do  not  capture  all  the  phenomena  that  significantly  affect 
vehicle  performance,  there  is  a  chance  that  later  on,  when  more  accurate  analyses  are 
conducted  (such  as  wind  tunnel  tests)  the  selected  vehicle  will  be  found  to  have  poor 
performance.  If  poor  performance  is  discovered,  the  design  must  be  modified  until  performance 
improves,  a  process  which  may  cause  overruns  of  schedule  or  budgetary  goals.  It  is  in  the 
designer’s  interest  to  address  the  risk  of  deficient  performance  when  making  decisions. 

This  research  effort  focuses  on  conceptual  and  preliminary  design,  when  decisions  are  made  to 
fix  parameters  at  particular  values  and  a  baseline  configuration  is  selected.  The  potential  for 
mistakes  introduced  by  low-fidelity  modeling  may  be  addressed  in  a  number  of  ways: 

•  Trade  studies  to  quantify  how  decisions  might  affect  predicted  performance; 

•  Repeating  experiments  at  one  level  of  fidelity  to  assess  the  uncertainty  of  predictions  at 
that  level  of  fidelity;  and, 

•  Repeating  experiments  at  different  levels  of  fidelity  to  assess  the  degree  to  which  the 
change  in  fidelity  level  affects  the  performance  predictions. 
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“Level  of  fidelity”  can  have  diiferent  meanings  depending  on  the  context,  although  in  general  it 
ean  be  taken  to  mean  the  degree  to  whieh  the  analysis  matehes  reality.  [8]  When  deseribing 
physieal  experiments,  an  experiment  that  matehes  flight  Maeh  number  is  of  lower  fidelity 
than  one  that  matehes  both  Maeh  number  and  Reynolds  number.  Computational  models 
have  a  wider  range  of  fidelity;  some  models  assume  that  viseous  eflfeets  are  negligible,  or  that 
only  linear  eflfeets  are  signifieant. 

3.3  Trade  Studies  and  Optimizations 

At  eaeh  stage  of  the  design  proeess,  the  available  design  spaee  is  explored  to  determine  how 
deeisions  might  aflfeet  vehiele  performance.  The  design  space  is  defined  by  the  parameters 
being  investigated  and  the  allowable  ranges  of  those  parameters.  Exploring  the  design  space  may 
take  the  form  of  trade  studies  or  optimizations.  Put  briefly,  optimization  progressively  (and  in 
some  eases,  automatieally)  modifies  a  design  to  maximize  or  minimize  some  user-seleeted 
response  funetion;  in  eontrast,  trade  studies  are  used  to  generate  data  about  various  potential 
designs  whieh  will  be  used  to  support  deoision-making.[41] 

In  both  proeesses,  multiple  designs  are  analyzed  to  evaluate  performanee.  The  results  are  used 
to  learn  how  the  design  parameters  affect  the  response(s)  of  interest.  In  an  optimization,  the 
results  will  be  converted  into  a  seore  using  an  explieit  quantitative  objeetive  funetion,  defined 
at  the  beginning  of  the  proeess.  One  or  more  new  eandidates  will  then  be  generated  based  on 
the  observations  and  analyzed  to  determine  objeetive  funetion  seores.  This  objeetive  may  be  a 
direet  output  of  the  analyses,  sueh  as  vehicle  weight  or  operating  costs,  or  it  may  be  a 
eombination  of  multiple  responses,  eaeh  given  a  partieular  weight  to  refleet  the  importance  of 
improvement  in  that  response.  The  optimization  proeess  will  be  repeated  until  an  optimum  is 
identified  or  the  allotted  resourees  have  been  expended.  In  essenee,  an  optimization  may  be 
eonsidered  to  be  a  repeated  trade  study  with  a  known,  quantifiable  objeetive. 

In  a  trade  study,  this  information  may  be  used  to  support  a  deeision  direetly,  or  used  as  the 
foundation  for  another  trade  study  to  investigate  any  interesting  behavior  observed.  The 
System  Engineering  Manual  for  the  Eederal  Aviation  Administration’s  National  Aerospaee 
System  states  that  a  trade  study  is  used,  “to  identify  the  most  balaneed  technieal  solutions  among 
a  set  of  proposed  viable  solutions.”[53]  The  Manual  also  states  that  use  of  trade  studies 
“prevents  program/projeet  management  from  eommitting  too  early  to  a  design  that  may  not  be 
eost  eflfeetive  or  meets  [5/c]  all  system  requirements.”  These  trade  studies  are  performed  at  all 
stages  of  the  design  proeess,  [41, 42,  46]  and  are  used  to  investigate  the  design  spaee.  By 
investigating  the  results,  the  effects  of  various  design  parameters  on  response  behavior  maybe 
inferred.  This  allows  the  deeision-maker  to  identify  whieh  parameters  signifieantly  affeet  the 
responses  of  interest.  If  a  parameter  does  not  signifieantly  aflfeet  any  responses  of  interest,  it 
may  be  set  to  some  reasonable  default  value  and  omitted  from  future  trade  studies,  simplifying 
the  effort. 

Deeision  makers  are  faeed  with  eonfiieting  motivations  when  setting  up  a  trade  study.  There  is 
ineentive  to  inelude  as  many  parameters  as  possible  in  a  trade  study  so  that  every  effect  can  be 
investigated,  ineluding  interactions  between  design  variables.  Hutehins  et  al.[84]  cite  “the 
often  strong  eoupling  between  the  variables  and  the  highly  multidiseiplinary  nature  of  the  design” 
as  motivation  to  inelude  as  many  parameters  as  possible  in  an  aero-struetural  tool  for  wing  trade 
studies.  Eurthermore,  as  the  project  moves  forward  some  design  parameters  are  frozen; [104] 


7 

Approved  for  public  release;  distribution  unlimited 


the  decision-  maker  must  take  care  not  to  freeze  a  parameter  before  its  effects  are  known  with 
confidence,  or  else  risk  costly  backtracking  if  the  frozen  value  is  later  found  to  be 
detrimental. [84,  166]  These  factors  provide  the  incentive  to  perform  large  trade  studies  with 
many  parameters. 

On  the  other  hand,  a  trade  study  seldom  can  include  every  variable  of  interest.  These  studies 
must  take  place  within  an  ongoing  design  effort,  and  thus  will  not  have  unlimited  resources.  As 
the  number  of  design  variables  in  a  study  increases,  the  simulation  effort  required  to  complete 
that  study  increases  very  rapidly,[72,  99]  an  effect  known  as  the  curse  of  dimensionality.  As  a 
result,  trade  studies  often  must  be  carefully  designed  to  ensure  they  can  be  completed  using 
the  experimental  resources  available,  and  without  taking  an  excessive  amount  of  time.  [89] 
Careful  selection  of  experiments  can  ease  such  constraints  to  some  degree,  but  they  cannot  be 
ignored  altogether.  Thus,  these  constraints  may  limit  the  analysis  tools  that  may  be  used  in  a 
given  trade  study. 

3.4  Speed  and  Fidelity  of  Analyses 

Every  computational  analysis  tool  makes  certain  simplifying  assumptions  in  its  representation 
of  the  world,  even  if  only  through  spatial  and/or  temporal  discretization.  Jameson[89]  estimated 
that  modeling  the  airflow  around  a  typical  aircraft  with  discretization  fine  enough  to  capture 
boundary  layer  behavior  over  all  active  length  scales  would  require  on  the  order  of  ten  billion 
mesh  points  within  the  boundary  layer  alone.  Capturing  the  evolution  of  that  flow  field 
through  time  would  further  require  roughly  fifteen  thousand  time  steps  per  second.  This 
degree  of  resolution  would  put  modeling  well  out  of  reach  at  present.  Jameson  goes  on  to 
note,  however,  that  the  amount  of  information  produced  by  such  an  analysis  is  far  in  excess  of 
what  is  required  for  typical  engineering  efforts. 

Most  engineering  analysis  focuses  on  large-scale  responses,  such  as  the  total  drag  on  a  vehicle. 
If  precision  is  not  critical,  these  responses  can  be  estimated  using  lower-fidelity  methods 
which  make  simplifying  assumptions.  These  assumptions  usually  posit  that  certain  factors  have 
insignificant  effects  relative  to  the  precision  required,  and  thus  may  be  neglected.  Although 
these  simplified  methods  are  of  lower  fidelity,  they  can  be  very  useful  if  the  dominant  effects 
are  still  captured.  For  example,  the  Newtonian  theory  of  gravity  produces  a  very  good 
approximation  of  orbital  motion.  Based  on  observations  of  the  seven  known  planets,  the 
position  and  existence  of  the  planet  Neptune  was  predicted  in  1846  using  Newtonian  theory. 
However,  despite  astronomers’  best  efforts,  the  theory  could  not  account  for  the  observed 
precession  of  Mercury’s  orbit,  and  for  years  astronomers  searched  for  a  new  planet  to  explain 
the  inconsistencies.[191]  When  Einstein  published  his  theory  of  general  relativity,[101]  its 
ability  to  accurately  model  Mercury’s  orbit  was  a  major  achievement  and  a  strong  argument  in 
favor  of  the  new  theory.  Still,  the  Newtonian  theory  of  gravity  was  used  to  correctly  predict  the 
existence  and  position  of  Neptune,  demonstrating  that  a  model  may  be  useful  and  informative 
even  when  it  is  not  a  perfect  reproduction  of  reality. 

There  exist  a  variety  of  levels  of  modeling  fidelity  for  aerodynamics.  [169]  Among  the  most  ac¬ 
curate  are  computational  fluid  dynamics  (CFD)  simulations  that  capture,  to  varying  degrees, 
viscous  effects.  Different  approaches  make  different  simplifying  assumptions  to  reduce  the 
large  computational  demands  imposed  by  viscous  flow  interactions.  These  viscous  models 
range  in  complexity  from  Earge  Eddy  Simulation,  which  constrains  the  lower  limit  of  the  length 
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scale  of  the  viseous  behaviors  eaptured,  to  Reynolds- Averaged  Navier-Stokes  (RANS) 
simulations  which  only  attempt  to  model  time-averaged  behavior  of  the  turbulent  motion.  Euler 
flow  models,  the  next  level  of  simplifieation,  negleet  viseous  effeets  but  retain  other  non-linear 
effeets.  Making  the  additional  assumption  that  rotational  flow  effeets  are  insignifieant  yields 
the  potential  flow  equations,  whieh  may  be  linearized  for  an  even  simpler  model.  Some  models 
intended  to  analyze  fairly  slow  flight  speeds  may  also  negleet  the  effects  of  compressible  flow. 

The  modeling  approaehes  in  the  previous  paragraph  were  developed  from  analytieal  deseriptions 
of  fluid  behavior  using  a  series  of  assumptions  as  to  whieh  fluid  behaviors  would  significantly 
affect  the  desired  outputs,  sueh  as  the  resulting  lift  of  the  vehiele.  The  loss  of  accuraey  resulting 
from  eaeh  simplifieation  depends  on  the  response  being  modeled  and  the  problem  being 
analyzed  -  assuming  ineompressible  flow  at  hypersonie  eonditions  will  introduee  more  error 
than  the  same  assumption  at  Mach  0.1.  Incompressible  potential  flow  models  may  give  a  good 
approximation  of  lift  on  a  vehicle  at  slow  speeds,  but  the  vehiele  drag  will  be  vastly 
underestimated. 

An  alternative  approaeh  to  estimating  aerodynamie  responses  is  to  draw  upon  known  flight 
performanee  of  existing  aireraft.  By  its  nature,  this  data  eaptures  all  relevant  fluid  behaviors 
beeause  it  is  eompiled  from  aetual  flight  test  data  rather  than  simulations.  With  data  from 
enough  aireraft,  trends  and  patterns  ean  be  identified  and  eorrelated  with  design  parameters.  Tail 
volume  eoeflheients  are  an  example  of  the  applieation  of  sueh  a  trend.  [162]  Using  these  trends 
and  some  aerodynamie  theory,  the  performanee  of  a  new  aireraft  design  can  be  estimated.  [47] 
This  type  of  model  is  known  as  a  semi-empirieal  or  handbook  method,  and  it  ean  be  very 
powerful  when  applied  to  a  eoneept  similar  to  those  used  as  referenees  for  the  model. 
However,  the  relations  in  the  model  are  only  indireetly  based  on  flow  physics,  and  if  the  model 
is  applied  to  a  eonfiguration  that  is  not  similar  to  those  in  the  underlying  data  set  its  predictions 
may  not  be  aeeurate. 

Most  analysis  tools  will  adequately  eapture  simple  responses  sueh  as  normal  force,  but  a 
eomplex  result  like  the  eenter  of  pressure  on  a  body  will  highlight  any  inaccuraeies  of  lower- 
fidelity  tools.  [129]  The  eenter  of  pressure  must  be  predieted  aeeurately  if  aerodynamic 
moments  are  important.  Handbook  methods  often  prediet  eenters  of  pressure  that  are  not  as 
accurate  as  those  from  Euler  simulations,  whieh  in  turn  are  less  preeise  than  viscous  predictions. 
Unfortunately,  the  superior  predietion  aeeuraey  of  viseous  caleulations  eomes  at  a  greatly 
inereased  eomputational  eost. 

In  general,  the  simpler  the  analysis,  the  faster  it  will  exeeute.  OVERELOW[137]  is  a  RANS 
model  that,  depending  on  the  eomplexity  of  the  analysis  and  the  amount  of  resources  alloeated, 
ean  eomplete  a  viseous  analysis  of  a  eonfiguration  at  one  flight  eondition  in  perhaps  a  few 
hours  on  a  parallel  eomputing  eluster.  Cart3D,[6]  an  Euler-level  flow  solver,  ean  reaeh  an 
inviseid  solution  in  roughly  half  an  hour  on  a  single  eomputing  node  of  eight  eores.  The 
Unified  Distributed  Panel  (UDP)  program,  the  subsonic/low-supersonie  portion  of  the 
Aerodynamie  Preliminary  Analysis  System  (APAS),[181]  is  a  potential  flow  solver  and  ean 
estimate  the  aerodynamie  performanee  of  an  aireraft  at  one  flight  eondition  in  a  fraetion  of  a 
seeond  on  a  desktop  eomputer. 

Eurthermore,  tools  that  eapture  more  eomplex  phenomena  may  require  more  information 
before  an  analysis  ean  be  performed.  Defining  a  eonfiguration  for  APAS  typieally  requires  a 
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few  dozen  eross-seetions;  defining  a  eonfiguration  for  CartSD  requires  a  full  water-tight 
surfaee  mesh  deseribing  the  entire  vehicle,  including  joints  between  components.  This  requires 
a  more  detailed  description  of  the  vehicle  which  may  not  be  available  early  on  in  the  design 
process,  particularly  if  the  configuration  is  not  defined  in  a  parametric  environment  and  changes 
have  to  be  made  manually.  As  an  added  complication,  vehicle  performance  may  be  affected  by 
geometric  details  as  small  as  the  blending  between  two  wing  sections,  [65]  or  the  doors  covering 
payload  or  landing  gear  bays. [183]  Lee  demonstrated  that  small-scale  nacelle  details  can  cause 
shock  waves  strong  enough  to  affect  the  entire  upper  surface  of  the  wing,  drowning  out  many 
other  effects  of  interest.  [103]  Using  a  high-fidelity  tool  and  making  unfortunate  choices  for 
default  settings  may  make  it  difficult  or  impossible  to  carry  out  a  trade  study  effectively. 

Oberkampf  and  Roy[138]  agree  that  it  may  be  inappropriate  to  use  the  highest  possible  level  of 
physics  modeling  for  every  computational  study.  A  highly  detailed  simulation  of  the  flow 
field  around  an  aircraft  can  be  extremely  computationally  intensive,  and  the  resulting  level  of 
solution  detail  may  vastly  exceed  what  is  necessary.  Such  in-depth  modeling  can  produce 
results  with  very  good  confidence  but  at  such  high  cost  that  the  approach  becomes  infeasible. 
Trade  studies  are  usually  conducted  as  part  of  a  larger  design  effort  and  therefore  must  be 
completed  within  a  budget  of  allocated  resources.  These  resources  may  take  the  form  of  time, 
manpower,  computational  effort,  financial  budget,  or  other  factors.  If  the  study  takes  too  long 
to  complete  it  may  be  ended  prematurely,  or  the  rest  of  the  design  process  may  be  delayed. 
The  program  managers  will  have  to  choose  between  moving  forward  with  insufficient 
information  to  support  their  decisions,  or  with  fewer  resources  available  for  later  work.  To  avoid 
these  negative  outcomes,  trade  studies  should  be  carefully  planned  to  balance  the  worth  of  the 
information  gained  against  the  cost  of  providing  that  information. 

One  key  technique  for  finding  this  balance  is  to  pay  attention  to  prediction  confidence.  The 
predictions  of  high-fidelity,  high-complexity  models  are  accompanied  by  tight  confidence 
intervals.  Those  tight  confidence  intervals  indicate  that  the  true  value  of  the  quantity  being 
simulated,  such  as  lift  or  drag,  is  likely  to  be  very  close  to  the  predicted  value.  However,  such 
confidence  intervals  usually  come  at  the  expense  of  time,  manpower,  and  computational  effort. 

As  models  are  simplified,  prediction  confidence  may  be  reduced  and  the  confidence  intervals 
grow  larger.  When  a  designer  defines  the  prediction  confidence  required  for  a  study,  he  or  she 
thus  constrains  the  minimum  level  of  model  complexity  that  can  still  meet  the  objectives  of  the 
study. 

These  confidence  interval  requirements  will  vary  as  the  design  process  moves  forward.  Early 
studies  have  relatively  broad  confidence  intervals  because  they  describe  the  aircraft  using 
only  a  handful  of  major  design  parameters  such  as  wing  reference  area.  Many  other  aspects  of 
the  vehicle,  such  as  the  precise  shape  of  the  fuselage,  are  undefined  or  roughly  approximated  at 
that  stage.  Using  the  results  of  such  a  study,  desirable  values  for  this  first  set  of  parameters  may 
be  identified.  Later  studies  will  refine  the  vehicle  concept  by  incorporating  more  parameters, 
such  as  airfoil  selection  and  control  surface  sizes.  This  cycle  is  repeated  until  the  entire  vehicle 
is  defined  to  a  degree  that  manufacturing  may  begin.  Each  trade  study  must  be  carefully 
planned  according  to  its  objectives  and  resources. 
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3.5  Planning  a  Trade  Study 

As  described  in  Section  1 .2,  the  simulation  etfort  for  a  study  increases  rapidly  as  more 
parameters  are  included  in  the  study.  [72]  This  provides  the  motivation  to  incorporate  only  the 
parameters  which  are  likely  to  significantly  affect  the  results.  To  quantitatively  identify 
important  parameters  out  of  a  pool  of  candidates,  screening  tests  may  be  performed.  [193] 
Screening  tests  can  be  used  to  identify  the  parameters  that  most  affect  the  behavior  of  the 
response,  and  the  magnitude  of  their  effect.  They  may  also  identify  important  second-order 
interactions  between  parameters,  although  this  typically  requires  more  testing  than  a  screening 
test  for  first-order  effects  only.  These  screening  tests  are  performed  by  running  the 
computational  model  for  particular  sets  of  input  values  and  investigating  the  results  with 
statistical  techniques.  A  screening  test  requires  some  investment  of  effort,  but  may  identify 
parameters  that  may  safely  be  omitted  from  trade  studies,  reducing  the  overall  effort  required 
without  sacrificing  the  quantity  &  quality  of  information  produced. 

In  addition  to  screening  tests,  the  set  of  active  parameters  in  a  study  may  be  reduced  by 
eliminating  variables  that  have  previously  been  investigated.  For  example,  after  a  conceptual 
design  study  has  identified  a  particular  value  of  wing  sweep  angle  as  being  most  beneficial  to 
vehicle  performance,  that  value  may  be  used  as  the  default  for  later  studies.  A  design  with 
multiple  parameters  that  have  been  fixed  may  be  considered  the  baseline  configuration  for  future 
studies.  This  simplifies  those  future  studies  by  reducing  the  number  of  free  parameters.  When 
Boeing  was  designing  the  777,  the  external  shape  of  the  vehicle  was  fixed,  or  “frozen,”  during 
preliminary  design.  Subsequent  studies  only  investigated  trades  that  would  not  affect  this  outer 
mold  line. [89] 

Although  defaulting  and/or  screening  out  variables  will  reduce  the  expense  of  a  given  study,  it 
will  not  eliminate  the  pressure  to  execute  the  study  within  the  time  and  resources  allotted  by 
the  project  schedule.  The  trade  study  designer  is  still  responsible  for  balancing  the  information 
gained  against  the  costs  of  generating  that  information. [89]  The  Pegasus  booster,  developed 
in  the  late  1980s,  relied  entirely  on  computer  simulations  using  a  variety  of  analysis 
tools.  [131]  Due  to  the  limitations  of  the  processing  capabilities  available  at  the  time,  almost  all 
of  the  analysis  was  done  using  potential  flow  solvers  and  impact  methods.  Euler-  and  RANS- 
level  models  were  applied  only  a  handful  of  times,  principally  for  confirmation  or  correction  of 
the  lower-fidelity  tools.  In  particular,  the  more-accurate  models  were  used  to  investigate  the 
possibility  of  plume-induced  flow  separation  and  interactions  between  shock  waves  and 
boundary  layers,  effects  that  could  not  be  captured  using  simpler  tools.  For  the  most  part,  full- 
scale  flight  testing  confirmed  the  performance  predictions. 

This  example  demonstrates  more  than  one  important  concept.  First,  flow  behaviors  that  were, 
or  might  be,  important  were  identified  by  the  team  in  advance.  Tests  were  done  using  tools 
of  appropriate  fidelity  to  determine  whether  those  flow  behaviors  would  significantly  affect  the 
vehicle’s  performance.  Secondly,  these  high-fidelity  tools  were  applied  intelligently,  with 
emphasis  on  conditions  which  would  be  the  most  difficult  for  lower-fidelity  tools  to  capture 
accurately,  such  as  high  angle-of-attack  flight.  This  aided  the  team  in  estimating  an  upper  bound 
for  error  in  the  lower-  fidelity  performance  predictions.  Finally,  the  modeling  tools  for  the 
primary  effort  were  selected  based  on  the  flow  phenomena  that  were  expected  to  be 
significant  as  well  as  the  resources  avail-  able  for  the  effort.  Because  complex  phenomena 
were  not  found  to  be  significant,  lower-fidelity  tools  could  be  applied  safely  without 
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introducing  much  error.  These  tools  exeeuted  more  quiekly,  allowing  a  larger  number  of 
analyses  to  be  performed  over  the  flight  eonditions  of  interest.  Ordinarily,  sueh  a  study  would 
use  wind  tunnel  testing  to  substantiate  the  predietions  of  eomputational  models.  In  the  ease  of 
the  Pegasus  booster  program,  no  eommereial  wind  tunnel  time  was  available  for  nearly  a  year,  a 
delay  that  was  incompatible  with  the  projeet  sehedule.[129]  The  projeet  team  instead  applied 
multiple  independent  analysis  tools  aeross  the  trajectory  space  of  interest  in  order  to  estimate  the 
eonfidenee  in  eaeh  predietion. 

When  setting  up  a  trade  study,  designers  must  aeeount  for  multiple  faetors:  the  number  of 
parameters  to  be  investigated,  the  amount  of  resources  available,  and  the  level  of  fidelity  of 
the  analysis  or  analyses  that  will  be  used.  Using  simpler  analyses  may  greatly  reduee  the 
resouree  eost  per  experiment,  but  designers  must  take  eare  to  ensure  that  the  analyses  will 
eapture  all  relevant  effects  and  behaviors.  Using  inadequate  tools  may  expose  the  projeet  to 
signifieant  risks. 

3.6  The  Cost  of  Inadequate  Analysis  Fidelity 

Unless  the  significant  phenomena  are  identified,  it  may  be  impossible  to  determine  the  amount 
of  error  or  uncertainty  introdueed  by  simplifying  assumptions.  A  design  effort  lacking  this 
information  is  at  risk  of  selecting  a  concept  with  defieient  performanee;  this  deficieney  will 
then  go  unreeognized  until  higher-fidelity  tools  are  applied  later  in  the  projeet,  potentially 
leading  to  baektraeking  and/or  repetition  of  previous  studies.  Dorsey  et  al.[46]  deseribed  a 
sizing  trade  study  for  a  reusable  launeh  vehiele  that  captured  propellant  tank  and  intertank 
strueture,  ineluding  detailed  tank  shape  &  arrangement  parameters.  Prior  parametric  weight 
estimation  tools  did  not  eapture  these  faetors,  and  “[a]s  a  result  major  perturbations  have  been 
made  to  the  vehiele  strueture”  whieh  led  to  the  “erroneous  result  of  no  apparent  impaet  on  the 
vehiele  total  weight  was  obtained.”  When  the  tank  &  intertank  parameters  were  investigated 
with  more  aeeurate  modeling  tools,  it  was  observed  that  struetural  weight  was  aflfeeted  quite 
strongly  by  design  parameters  sueh  as  vehiele  half  body  angle  and  payload  earriage  loeation. 
For  example,  moving  the  oxygen  tanks  from  the  rear  to  the  front  of  a  vehiele  ehanged  the 
struetural  loads,  eausing  an  18  percent  inerease  in  the  required  struetural  mass;  this  effect  was 
not  captured  using  lower- fidelity  tools,  and  deeisions  which  would  significantly  affect  the 
performanee  of  the  vehiele  might  have  been  made  without  aeeurately  understanding  the 
eonsequenees. 

Other  examples  of  the  hazards  of  insuffieient  model  fidelity  may  be  found  in  the  literature. 
Aeroelastieity  is  the  eomplex  interaetion  between  aerodynamie  loads  on  a  body,  its  struetural 
response,  and  the  eflfeet  of  that  response  on  the  body’s  aerodynamies.  Capturing  this  behavior 
requires  knowledge  of  the  loeal  aerodynamie  loads  on  vehiele  eomponents  as  well  as  a 
representation  of  eomponent  structures.  The  detailed  information  required,  along  with  the 
feedback  loop  inherent  in  the  phenomenon,  may  preelude  this  analysis  until  late  in  the 
preliminary  design  phase  when  mueh  of  the  proposed  vehiele  has  been  defined;  this  deeision 
earries  with  it  the  unstated  assumption  that  aeroelastie  effects  will  not  significantly  influence 
the  design.  Werner- Westphal  et  al.[192]  demonstrated  that  for  rearward-swept  wings,  negleeting 
aeroelastie  eflfeets  may  be  a  conservative  assumption  -  that  is,  aeroelastie  effeets  reduee  wingtip 
loads  and  by  extension  wing  struetural  requirements.  However,  this  eflfeet  is  reversed  for  vehieles 
with  forward-swept  wings:  for  sueh  vehieles,  aeroelastie  effects  increase  wingtip  loads,  and 
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accounting  for  the  higher  loads  may  drive  up  wing  struetural  weight  of  the  example  aireraft  by 
roughly  eight  pereent. 


The  ineentive  for  high-fidelity  modeling  is  not  only  an  issue  for  multi-diseiplinary  analyses: 
Collier  et  al.[35]  investigated  the  degree  to  whieh  different  struetural  requirements  drove  wing 
box  structure  weight.  When  damage  initiation  and  damage  tolerance  (i.e.,  erack-growth) 
eonstraints  were  taken  into  aeeount,  a  design  that  previously  appeared  suflheient  was  found  to 
have  negative  margin.  Modifying  the  design  to  satisfy  the  violated  eonstraints  resulted  in  an 
inerease  of  the  required  struetural  mass  of  the  wing  by  12  pereent.  Unless  these  modifieations 
ean  be  identified  and  executed  relatively  early  in  the  design  proeess,  the  designers  may  be 
foreed  to  repeat  many  analyses  to  understand  the  eflfeets  of  these  ehanges  on  the  performanee  of 
the  overall  vehiele.  This  eonsequenee  earries  the  ehoiee  between  a  dual  penalty  of  lost  time 
and  inereased  modeling  effort,  or  a  loss  of  knowledge  about  the  vehiele. 

It  bears  mentioning  that  the  risk  introdueed  by  insuflheient  modeling  fidelity  is  not  solely  a 
problem  for  eomputer  models.  The  Loekheed  C-141  Starlifter  was  designed  in  the  early  1960s 
with  experimental  support  from  wind  tunnel  testing.  When  full-scale  flight  testing  began,  the 
handling  characteristics  were  significantly  worse  than  predieted.  Experiments  identified  the 
souree  of  the  diserepancy:  although  some  shoek-indueed  flow  separation  was  expeeted,  the 
flow  separation  at  flight  Reynolds  number  differed  from  what  was  observed  in  the  wind 
tunnel  experiments  by  as  mueh  as  20  pereent  of  the  loeal  wing  ehord.[23]  Furthermore  it  was 
found  that,  given  the  knowledge  at  the  time,  only  simulations  at  the  full  flight  Reynolds 
number  would  have  adequately  predicted  the  Starlifter’s  in-flight  behavior.  [17] 

The  Spaee  Shuttle  development  program  provides  another  example:  after  the  first  flight  test,  it 
was  found  that  hypersonie  trim  at  high  angle  of  attaek  required  a  signifieantly  larger  defieetion 
of  the  body  flap  than  was  planned.  Although  the  differenee  in  pitehing  moment  eoeflheient  was 
only  0.03,  correeting  the  diserepancy  required  16°  ofdefleetion  rather  than  7°,  leaving  less  than 
one-third  of  the  expeeted  eontrol  margin  for  maneuvering  or  eontrolling  dispersions.  Later 
testing  indicated  that  real  gas  eflfeets,  whieh  were  not  investigated  before  the  flight,  aecounted 
for  the  majority  of  the  diserepancy.  [86, 133] 

Any  study  conducted  with  inadequate  fidelity  may  lead  to  a  eoneept  whieh  will  later  be  revealed 
as  infeasible  or  deficient.  These  shortfalls  may  signifieantly  impact  the  design  effort.  Given 
that  such  deficiencies  would  only  be  identified  by  higher  fidelity  modeling,  they  will  not  be 
eaptured  until  later  -  possibly  mueh  later  -  in  the  design  proeess.  Onee  diseovered,  reaetionary 
ehanges  ean  begin,  but  the  impact  of  these  changes  on  the  state  of  the  design  may  be  very  large 
if  more  than  a  few  analyses  have  been  performed  using  the  old  design.  It  may  be  less  eostly, 
then,  to  ensure  that  the  analysis  tools  are  adequate  for  their  purpose  in  the  study;  this  may  be 
dilfieult  to  aehieve  without  a  separate  investment  of  effort. 


3.7  Determining  Sufficient  Fideiity 

Researchers  have  identified  a  variety  of  techniques  for  assessing  the  ability  of  eomputational 
tools  to  support  a  particular  analysis.  This  process  is  known  as  validation.  The  AIAA  Guide 
for  the  Verifieation  and  Validation  of  Computational  Fluid  Dynamies  Simulations[8]  defines 
validation  as  “the  process  of  determining  the  degree  to  whieh  a  model  is  an  aecurate 
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representation  of  the  real  world  from  the  perspective  of  the  intended  uses  of  the  model.”  This 
definition  emphasizes  that,  while  validation  does  compare  model  predictions  to  the  “real  world,” 
it  is  important  to  focus  on  the  intended  uses  of  the  model.  Put  another  way,  a  previously- validated 
tool  may  need  to  be  re-validated  if  it  is  to  be  applied  to  a  substantially  different  problem,  and  the 
similarity  of  validation  tests  to  the  planned  application  of  the  tool  is  critical. 

One  approach  to  tool  validation,  used  in  the  development  of  the  Pegasus  booster,  [129]  is  to 
repeatedly  analyze  a  vehicle  at  some  important  flight  conditions,  such  as  cruise  and  landing, 
using  different  computational  tools.  These  tools  should  be  independent;  any  common  code 
shared  between  tools  will  make  it  more  difficult  to  identify  inaccuracies  in  the  results.  The 
selection  of  tools  should  represent  not  just  various  analysis  programs,  but  different  levels  of 
fidelity  in  order  to  capture  the  errors  introduced  by  simplifications.  The  higher  the  fidelity  of 
the  tools  included  in  the  effort,  the  greater  the  likelihood  that  the  results  reflect  reality,  although 
this  cannot  be  guaranteed  without  recourse  to  physical  experiments. 

This  method  must  not  be  applied  recklessly.  A  group  of  handbook  methods,  applied  to  the  same 
unconventional  vehicle,  may  all  produce  similar  results;  this  should  be  taken  as  an  indication 
that  there  is  not  a  substantial  difference  in  accuracy  between  those  codes,  but  not  an  indication 
that  the  codes  all  produce  accurate  results.  Accuracy  can  only  be  determined  by  comparing  the 
handbook  estimates  against  higher-fidelity  models.  When  an  empirical  method  is  applied  to 
configurations  and  flight  conditions  similar  to  those  used  to  build  the  tool,  it  may  be  highly 
accurate.  This  pedigree  is  lost,  however,  when  the  method  is  applied  to  unconventional  designs 
or  unusual  flight  conditions. 

To  better  capture  the  prediction  uncertainty,  multiple  levels  of  fidelity  should  be  used.  The 
tools  should  be  applied  to  a  problem  as  similar  as  possible  to  the  topic  of  the  planned  trade  study. 
It  does  little  good  to  verify  a  code’s  accuracy  when  applied  to  a  subsonic  large  transport  if  the 
goal  is  an  accurate  performance  estimate  of  a  hypersonic  missile. 

Validation  experts[138,  141]  prefer  to  validate  analysis  tools  against  experimental  data.  Data 
from  wind  tunnel  experiments  (or  better  yet,  flight  tests)  will  naturally  capture  most  or  all 
relevant  phenomena.  This  reduces  the  risk  of  neglecting  important  effects,  creating  greater 
confidence  when  a  computational  tool  matches  experimental  results  well.  In  a  sense,  full-scale 
physical  experiments  maybe  considered  the  highest-fidelity  analysis.  This  fidelity  comes  at  the 
expense  of  data  availability. 

Readily-available  validation  data  may  come  from  institutional  memory,  or  it  may  be  in  the 
public  domain:  NATO’s  Advisory  Group  for  Aerospace  Research  and  Development  (AGARD) 
has  published  sets  of  data  intended  for  use  in  validating  computational  tools. [1 1,  188] 
Unfortunately,  this  data  is  necessarily  of  limited  scope.  The  AGARD  reports  include  only  one 
set  of  experimental  data  to  validate  full-vehicle  supersonic  analyses  -  a  combat  aircraft  research 
model  with  a  forward-  swept  wing  and  a  canard.  Unless  this  matches  the  expected 
application  of  the  code,  it  may  be  difficult  to  accurately  assess  prediction  uncertainties. 

When  the  objective  is  to  model  a  vehicle  that  is  similar  to  those  for  which  data  is  available, 
tool  validation  is  relatively  straightforward;  for  revolutionary  designs,  it  may  be  difficult  or 
impossible  to  find  pre-existing  experimental  data  for  tool  validation. 
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The  problem  becomes  more  complicated  when  experimental  uncertainty  is  included.  A  wide 
variety  of  uncertainties  are  present  in  any  experimental  data,  such  as  the  precise  wind  speed,  the 
ex-  act  angle  of  the  model,  and  the  accuracy  of  the  sensors  used.  Behavior  such  as  hysteresis 
may  also  affect  the  measurements  at  each  condition  depending  on  the  prior  experiments  run.  [12] 
This  information  should  be  included  with  the  nominal  data  set  for  accurate  uncertainty 
estimation,  but  most  published  data  sets  neglect  to  include  some  or  all  of  it.  The  cited 
AGARD  reports  are  exemplary  in  the  level  of  uncertainty  documentation,  but  those  reports 
were  explicitly  designed  to  be  used  for  validation;  it  is  rare  to  find  that  degree  of  documentation 
in  ordinary  experimental  reports.  For  this  reason,  even  if  experiments  were  performed  using 
configurations  and  conditions  similar  to  those  de-  sired,  it  may  still  be  diflhcult  to  accurately 
estimate  tool  prediction  accuracy  without  documentation  that  is  unusually  thorough. 

The  final  option  for  tool  validation  is  the  use  of  ad  hoc  experimental  data.  Wind  tunnel  or 
flight  test  experiments  may  be  designed  and  executed  with  the  objective  of  assessing  tool 
prediction  accuracy  for  one  or  more  scenarios.  Although  this  produces  high-quality  data  that  is 
directly  relevant  to  the  problem(s)  of  interest,  it  is  also  the  most  expensive  tactic.  Physical 
models  must  be  built  and  instrumented,  time  in  wind  tunnels  reserved,  and  data  taken.  Each 
data  point  may  need  to  be  sampled  repeatedly  in  order  to  quantify  experimental  variation  and 
path-dependent  effects.  Raymer  estimates  the  cost  of  wind  tunnel  experiments  at  “several 
hundred  thousand  dollars  per  model.”[  161]  To  its  credit,  this  option  does  allow  the  user  to 
specify  which  uncertainties  should  be  quantified,  and  allows  the  user  to  guarantee  that  the 
configuration  and  flight  conditions  are  relevant  to  the  intended  application  of  the  tools.  If 
time  and  resources  are  available,  this  is  the  approach  that  will  best  characterize  the  prediction 
confidence  of  the  modeling  codes  for  the  scenarios  of  interest.  Not  every  program  has  the 
resources  to  perform  such  experiments.  Mendenhall[129]  stated  that  the  Pegasus  booster 
program  was  executed  without  validation  from  physical  experiments,  not  by  preference,  but 
due  to  a  constrained  program  schedule  and  the  unavailability  of  wind  tunnels. 

Once  validation  data  and  all  tool  predictions  are  in  hand,  comparisons  can  be  made.  The  breadth 
of  the  predictions  will  give  the  user  some  idea  of  the  overall  prediction  uncertainty.  If 
predictions  are  scattered  widely,  the  simplifying  assumptions  of  the  models  significantly  affect 
the  calculations,  and  care  should  be  taken  to  select  the  appropriate  tool.  If  all  predictions  agree 
closely,  the  choice  of  tool  is  not  likely  to  introduce  significant  error  to  the  results.  This 
procedure  must  be  repeated  at  all  flight  conditions  of  interest  until  the  dependability  of  each 
model  at  every  relevant  flight  condition  has  been  established.  This  produces  an  estimate  of 
prediction  uncertainty  for  each  tool  when  applied  to  each  scenario  of  interest,  i.e.,  each 
combination  of  flight  condition  and  vehicle  configuration. 

The  question  of  how  to  define  “acceptable”  prediction  uncertainty  is  an  important  one,  and  one 
which  varies  with  the  application.  If  the  pitching  moment  is  of  interest,  a  designer  might 
require  a  level  of  uncertainty  smaller  than  the  expected  control  surface  authority.  More  generally, 
acceptable  uncertainty  can  refer  to  a  level  of  prediction  confidence  sufficient  to  discriminate 
accurately  between  the  options  being  considered.  This  degree  of  discrimination  will  change 
as  the  design  process  evolves,  becoming  progressively  more  precise.  An  example  of  such  an 
acceptable  uncertainty  is  a  rule  of  thumb,  such  as:  if  the  pitching  moment  coefficient  for  a 

vehicle  with  undeflected  control  surfaces  falls  within  ±0.1,  it  is  likely  that  some  set  of 
feasible  control  surface  deflections  can  be  found  to  negate  that  pitching  moment.  [24]  This  rule 
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of  thumb  allows  designers  evaluate  multiple  eonfigurations  without  having  to  tweak  eontrol 
surfaee  defleetions  to  trim  the  vehiele  for  eaeh  flight  condition  being  analyzed. 

Modeling  in  the  early  stages  of  design  may  be  somewhat  imprecise  without  causing  much 
concern,  because  most  design  parameters  have  yet  to  be  given  values.  As  the  concept  is  refined, 
smaller-  scale  changes  are  investigated,  and  more  precision  is  required  to  support  decisions. 

3.8  When  Requirements  Conflict 

Note  that  the  requirement  for  prediction  uncertainty  is  independent  of  the  computational  tools 
being  considered.  There  is  no  guarantee  that  any  tool  will  offer  the  required  confidence  at  a 
feasible  cost.  This  is  especially  true  if  the  objective  is  an  investigation  of  many  parameters  at 
once,  which  can  be  necessary  to  capture  interactions  between  parameters.  As  Raymer  says, 
“[t]he  choice  of  code  for  a  given  design  problem  depends  on  the  nature  of  the  problem  and  the 
available  budget  (and  not  always  in  that  order!).”[161]  What  can  be  done  if  tool  validation 
reveals  that  a  desired  trade  study  cannot  be  completed  without  violating  either  confidence 
requirements  or  resource  limits? 

The  designers  are  then  faced  with  a  conflict.  One  option  is  to  exceed  the  resources  allocated  to 
the  study  or  request  additional  support.  This  is  seldom  an  attractive  option,  and  may  not  be  a 
viable  choice  if  the  resource  limits  (whether  financial,  computational  or  temporal)  are  firm 
limits. 

Another  option  is  to  relax  the  prediction  confidence  constraints.  This  would  allow  simpler, 
faster  tools  to  be  used  in  the  trade  study,  and  may  reduce  the  resource  cost  enough  that  the 
study  could  be  completed  within  the  given  constraints.  On  the  negative  side,  this  introduces 
risk:  with  the  relaxed  confidence  requirement,  decision  makers  might  select  a  configuration 
that  is  predicted  to  safely  meet  performance  requirements,  but  in  fact  violates  them  instead.  This 
deficiency  will  not  be  recognized  until  a  more-accurate  analysis  is  done  sometime  after  the 
current  trade  study.  This  is  what  occurred  during  the  development  of  a  fly-back  booster  by 
DLR,  the  German  Aerospace  Center:  a  lower-fidelity  aerodynamics  tool  was  used  to  optimize 
the  design  of  a  canard.  Later,  when  higher-fidelity  modeling  was  performed,  it  was  found  that 
the  canard  shed  vortices  that  negatively  affected  wing  performance,  resulting  in  deficient 
performance.  [52] 

It  may  be  possible  to  improve  the  deficient  performance  using  the  remaining  unfixed  variables, 
although  this  suggests  that  the  performance  of  the  vehicle  could  have  been  even  better  if  the 
original  deficiency  were  not  present.  If  it  is  not  possible  to  ameliorate  the  design  problems, 
the  decision  makers  will  have  to  roll  back  the  design  process  to  address  the  source  of  the  trouble. 
Any  decisions  that  were  made  after  the  relevant  parameters  were  frozen  may  have  to  be 
revisited.  This  can  be  an  expensive  process  with  respect  to  both  effort  and  time.  When  DLR 
identified  poor  performance  in  their  fly-back  booster  design,  high-fidelity  analyses  -  in  the 
form  of  Euler  simulations  and  wind  tunnel  experiments  -  had  to  be  used  to  identify  the 
source  of  the  error,  which  was  found  to  stem  from  canard  tip  vortices  affecting  the  airflow 
over  the  main  wing.  These  analyses  were  costly,  but  simpler  models  demonstrably  did  not 
capture  the  flow  phenomenon  of  interest.  Once  the  canard  tip  vortices  were  identified  as  the 
source  of  the  discrepancy,  the  canard  was  modified  to  reduce  this  effect  and  the  results  were 
confirmed  using  Euler  CED.[176] 
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If  neither  the  resouree  limit  nor  the  confidence  requirement  can  be  relaxed,  the  designer’s  final 
option  is  to  change  the  scope  of  the  trade  study.  This  may  take  the  form  of  reducing  the  ranges  of 
the  variables  being  investigated,  eliminating  some  variables  from  the  study  entirely,  or  both. 
Reducing  variable  range  simplifies  the  problem  somewhat  by  limiting  the  portions  of  the 
design  space  that  will  be  explored.  This  can  be  an  effective  technique  to  improve  efficiency, 
but  there  is  a  risk  that  interesting  portions  of  the  design  space  might  be  ignored.  Removing 
variables  from  consideration  can  be  even  more  powerful:  Section  1 .2  described  the  Curse  of 
Dimensionality,  a  pithy  shorthand  for  the  observation  that  the  computational  cost  of  a  study 
increases  extremely  rapidly  as  the  number  of  parameters  grows.  Correspondingly,  the  cost  will 
decrease  rapidly  as  parameters  are  removed. 

In  many  engineering  problems,  most  outputs  are  significantly  affected  by  only  a  handful  of  in¬ 
puts,  a  phenomenon  known  as  “effect  sparsity”  or  the  Pareto  principle.  [193]  Although  a 
computer  model  may  have  dozens  of  input  parameters,  it  is  likely  that  most  only  have  minor 
effects  on  a  particular  output.  To  determine  which  parameters  are  important  to  a  given  response, 
screening  tests  may  be  performed.  Screening  tests  use  a  relatively  sparse  sampling  of  the 
design  space  to  identify  major  linear  and  nonlinear  effects,  including  any  interactions  between 
variables.  Thus,  a  screening  test  incurs  some  cost,  but  may  result  in  an  overall  savings  of 
resources  by  identifying  parameters  which  do  not  significantly  affect  the  results. 

For  some  problems,  this  principle  does  not  hold,  especially  if  many  responses  are  to  be 
investigated  simultaneously.  Although  each  response  depends  mostly  on  a  handful  of  inputs, 
there  is  no  guarantee  that  every  response  will  depend  on  the  same  handful  of  inputs.  The  more 
responses  that  must  be  captured,  the  worse  are  the  chances  that  any  given  parameter  has 
negligible  effect  on  all  responses.  Furthermore,  the  parameters  that  significantly  affect 
aerodynamic  responses  will  change  with  the  speed  and  angle  of  flight:  nose  shape  has  a  much 
greater  effect  at  supersonic  conditions  than  it  does  for  subsonic  flight.  Thus,  as  the  number  of 
important  flight  conditions  for  a  vehicle  in-  creases,  the  designer  can  expect  that  a  greater 
number  of  geometric  parameters  to  be  significant  for  vehicle  aerodynamics.  In  such  situations, 
designers  will  not  easily  be  able  to  omit  variables  without  omitting  portions  of  the  relevant 
design  space. 

What  can  be  done  in  this  situation?  Constraints  on  resources  encourage  rapid  evaluations  or 
fewer  analyses,  while  constraints  on  prediction  confidence  drive  the  user  toward  methods  with 
high  confidence  but  slow  evaluation  speeds.  If  the  study  cannot  be  captured  in  a  small  number 
of  analyses,  the  designer  must  investigate  techniques  that  can  reduce  the  cost  of  performing  the 
trade  study.  An  example  of  such  a  scenario  is  investigated  in  detail  in  the  next  section. 
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4  Motivating  Example 

The  conflicting  objectives  of  highly  accurate  data  and  affordable  analysis  costs  can  be 
illustrated  by  a  recent  research  effort.  The  Air  Force  Research  Laboratory  (AFRL)  sponsored 
research  into  modelingtheaerodynamicsofareusable  booster  system  (RBS). Inparticular, the 
AFRL  expressed  interest  in  a  hybrid  system,  such  that  the  first  stage  was  reusable  while  the 
upper  stage  or  stages  would  be  expendable.  For  this  concept,  the  first  stage  would  lift  the 
upper  stages  to  a  particular  altitude,  orientation,  and  speed,  and  then  the  first  stage  would  be 
jettisoned.  The  upper  portions  of  the  launcher  would  continue  toward  orbit  while  the  first  stage 
would  reverse  its  course  and  returned  to  the  original  launch  site.  The  maneuver  by  which  the 
vehicle  reversed  its  course  is  referred  to  as  a  return  to  launch  site  (RTLS)  maneuver.  [19,  71] 

Such  a  system  has  been  the  subject  of  research  by  various  other  entities,  including  NASA[144] 
and  DLR.[176]  Hellman[79]  gives  an  overview  of  some  studies  on  the  topic.  The  research 
sponsored  by  the  AFRL  focused  on  vehicles  which  employ  the  main  rocket  engines  to  offset 
the  downrange  velocity  of  the  booster  after  staging,  an  operational  concept  known  as 
“rocketback.”  After  the  engines  cease  firing,  the  vehicle  executes  a  gliding  return  to  the  launch 
site. 

Previous  studies  by  Masse[l  18]  and  Sippel[176]  highlight  one  of  the  major  design  challenges 
for  a  winged  reusable  booster;  stability  and  control.  The  booster  must  have  sufhcient  control 
authority  to  achieve  trim  along  the  entire  RTLS  trajectory.  The  gliding  portion  of  the  RTLS 
trajectory,  which  was  the  main  subject  of  this  effort,  can  include  flight  conditions  from  nearly 
hypersonic  reentry  speeds  down  to  low-subsonic  landing  speeds,  and  angles  of  attack  that  stretch 
from  single  digits  up  to  30  through  40°.  Identifying  a  vehicle  that  can  trim  at  such  a  broad 
variety  of  flight  conditions  is  a  demanding  design  task. 

Further  complicating  matters  is  the  complexity  of  the  aerodynamics  across  the  various  flight 
regimes.  Some  rapid  computer  models,  such  as  APAS,[181]  neglect  nonlinear  effects  which 
have  been  shown  to  be  important  for  reusable  boosters  by  Pamadi  et  al.[144]  In  addition,  APAS 
is  known  to  be  somewhat  inaccurate  when  predicting  pitching  moment  coefiicients  for 
configurations  with  long  fuselages  or  aft  center-of-gravity  locations, [143]  both  of  which  may 
be  the  case  for  a  winged  reusable  booster  vehicle.  To  increase  confidence  in  the  analysis 
results,  Cart3D,[6]  an  Euler  CFD  model,  was  selected  as  a  good  balance  of  fidelity  and  speed. 
The  results  of  Pamadi  et  al.  demonstrated  that  Cart3D  could  match  the  available  wind  tunnel 
data  for  the  Langley  Glide-Back  Booster  fairly  well.  Eggers  demonstrated  that  TAU,  another 
Euler  solver,  showed  good  agreement  with  the  wind  tunnel  results  for  another  reusable  booster 
configuration  after  sting  and  Reynolds  number  effects  were  subtracted.  [50] 

These  results  indicate  that  Cart3D  would  have  sufhcient  fidelity  to  capture  the  dominant  effects 
and  flow  behaviors.  Note  that,  as  an  inviscid  model,  Cart3D  would  not  capture  viscous  effects 
like  skin  friction  drag  or  boundary  layers.  This  was  deemed  an  acceptable  loss,  as  the  viscous 
modeling  required  to  capture  those  effects  would  increase  computational  costs  by  roughly 
another  order  of  magnitude.  Using  eight  processors,  Cart3D  can  typically  analyze  a  vehicle’s 
performance  at  a  given  flight  condition  in  30  to  60  minutes.  Details  about  the  way  that 
Cart3D  was  applied  for  this  work,  including  default  flow  solver  settings,  are  given  in  detail  in 
Appendix  B: . 
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The  objective  of  the  RBS  research  project  was  to  create  surrogate  models  which  captured  the 
aerodynamics  of  rocketback  vehicle  designs.  Those  surrogate  models  could  then  rapidly 
evaluate  the  performance  of  any  configuration  within  the  design  space.  The  surrogate  models 
would  be  functions  of  geometric  parameters,  such  as  the  fuselage  radius  or  wing  root  chord. 

A  parametric  vehicle  geometry  model  was  created  in  the  PaceLab  Engineering  Suite[142,  175] 
which  included  42  geometric  variables.  Certain  design  characteristics  were  considered  to  be 
fixed:  the  forward  fuselage  was  cylindrical,  transitioning  smoothly  to  a  fiat  bottom  in  the  rear. 
The  wing  would  be  mounted  at  or  near  the  rear  of  the  vehicle,  low  on  the  fuselage.  Rather 
than  a  dorsal  vertical  tail,  vertical  fins  would  be  located  on  the  wing  tips,  similar  to  the  X-20 
“Dyna-Soar”.[90]  For  a  more  detailed  explanation  of  the  geometry  model  and  the  meaning  of 
each  parameter,  see  Garmendia  et  al.[61] 

In  general,  model  parameters  were  selected  such  that  the  user  could  not  specify  an  impossible 
vehicle.  For  example,  if  the  independent  variables  included  the  leading  and  trailing  edge 
sweep  angles,  the  root  chord,  and  the  half-span  of  the  wing,  it  would  be  possible  to  select  a  set 
of  incompatible  parameter  values:  if  the  leading  edge  sweeps  aft  while  the  trailing  edge  sweeps 
forward,  the  two  edges  might  intersect  at  a  half-span  distance  smaller  than  what  was  specified. 
Thus,  if  that  parameterization  were  used,  individual  parameter  values  which  were  within  the 
allowed  ranges  could  be  combined  to  define  a  geometry  that  was  infeasible.  To  avoid  this,  the 
parameters  were  chosen  in  such  a  way  that  geometrically  infeasible  vehicles  were  not  possible. 
One  drawback  to  this  approach  is  that  vehicles  which  would  have  very  poor  aerodynamic, 
structural,  or  operational  characteristics  are  possible  within  the  design  space  -  for  example,  the 
vehicle  in  the  third  row,  third  column  of  Figure  1  might  be  expected  to  have  structural 
challenges,  and  could  be  at  risk  of  striking  the  wingtips  during  landing. 


Figure  1:  An  Illustration  of  the  Breadth  of  the  Design  Space 


The  aerodynamic  surrogate  models  would  relate  the  geometric  parameters  to  the  forces  and 
moments  acting  on  the  vehicle.  Because  the  dominant  flow  behaviors  would  change 
significantly  with  flight  condition,  it  was  decided  that  the  effort  would  initially  attempt  to  model 
the  aerodynamics  at  individual  flight  conditions  separately.  If  good  results  were  obtained,  further 
research  would  seek  to  expand  the  models  to  capture  the  effects  of  changing  flight  conditions,  as 
well. 
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The  vehicle  model  included  seven  control  surfaces  -  a  pair  of  elevons  on  each  wing,  a  rudder 
on  each  vertical  fin,  and  a  body  flap  at  the  rear  of  the  fuselage.  Each  control  surface  could  be 
deflected  independently  to  more  accurately  capture  the  control  interactions.  When  the 
deflection  angles  for  all  control  surfaces  were  combined  with  the  rest  of  the  geometric 
parameters,  the  resulting  design  space  had  forty-nine  dimensions.  Screening  tests  were 
performed  to  identify  any  parameters  which  did  not  significantly  affect  the  responses  of  interest. 

4.1  Screening  Tests 

A  second-order  screening  design  was  used  to  build  129  configurations  for  testing.  A  set  of  16 
flight  conditions  was  selected  as  representative  of  the  overall  envelope  of  the  vehicle:  Mach 
0.3,  0.8,  1.2  and  4.0;  O  10°  and  30°;  sideslip  0°  and  5°.  After  simulating  all  geometries  at 
each  combination  of  flight  condition  parameter  values,  a  sensitivity  study  was  performed  using 
JMP  analysis  software.  [91] 

The  study  results  led  to  several  observations.  First,  the  Pareto  Principle,  which  states  that  a 
small  number  of  variables  will  often  account  for  a  large  majority  of  the  response  behavior,[193] 
was  not  the  behavior  observed  in  this  case.  Instead,  although  individual  responses  were  more 
strongly  affected  by  some  parameters  than  others,  enough  responses  were  strongly  affected  by 
different  parameters  that  almost  all  of  the  design  variables  would  have  to  be  retained  to  capture 
the  bulk  of  the  response  variation.  Secondly,  because  all  aerodynamic  forces  and  moments  were 
considered  important  to  capture,  it  was  observed  that  every  design  variable  contributed 
significantly  to  at  least  one  response  at  some  flight  condition.  Given  these  observations,  it  was 
decided  that  all  49  design  variables  would  be  included  in  the  subsequent  modeling  efforts  in 
order  to  capture  the  most  knowledge  about  the  design  space  as  possible. 

4.2  Identifying  Similar  Studies 

Previous  studies  that  were  identified  during  initial  research  offered  little  guidance  when  a 
sampling  approach  was  being  selected.  The  Computerized  Environment  for  Aircraft  Synthesis 
and  Integrated  Optimisation  Methods  (CEASIOM)  is  a  conceptual  design  tool  intended  to 
improve  designers’  ability  to  capture  stability  &  control  responses  during  the  conceptual  phase  of 
design.  The  aerodynamics  tool  incorporates  high-  fidelity  modeling,  up  to  Euler  or  Navier- 
Stokes,  to  increase  confidence  in  the  results.  [63]  Multiple  levels  of  analysis  fidelity  and 
adaptive  sampling  are  used  to  identify  the  minima  and  maxima  of  the  responses.  Note  that  this 
tool  was  intended  to  be  applied  to  a  single,  fixed  configuration  at  a  time,  producing  Euler-level 
aerodynamics  for  that  configuration  overnight.  Thus,  it  is  not  well-suited  for  the  large  number 
of  configurations  considered  during  design  space  exploration. 

Scharl  &  Mavris  investigated  the  use  of  surrogate  models  for  aerodynamic  modeling  of  a  sub¬ 
sonic  transport.  [171]  Surrogate  models  were  trained  to  reproduce  the  simplified  aerodynamic 
model  HASC,  a  potential  flow  model  with  a  semi-empirical  vortex  lift  model,  to  enable  analysis 
of  stability  and  control.  This  would  allow  efficient  design  of  the  empennage  and  control 
surfaces  based  on  analysis  rather  than  historical  analogues.  The  design  space  included  21 
parameters,  such  as  Mach  number,  sideslip  angle,  angle  of  attack,  and  altitude.  3,500  random 
samples  were  used  for  training  and  validation  of  the  surrogate  models,  which  were  built  with 
artificial  neural  networks.  Although  this  effort  did  incorporate  parametric  geometry,  the 
sampling  techniques  used  were  somewhat  simplistic.  Additionally,  due  to  the  relative 
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simplicity  of  the  aerodynamic  model,  the  results  were  not  necessarily  applicable  to  a  study 
using  higher-fidelity  tools. 

Masse  and  Wilhite[118]  investigated  the  design  of  a  reusable  first-stage  booster  similar  to  the 
concept  behind  the  RBS  study  The  concept  was  described  using  1 1  geometric  parameters, 
and  APAS[45]  was  used  to  model  the  aerodynamics.  The  aerodynamic  responses  were  sampled 
using  a  3-  or  4-level  full  factorial  scheme,  and  a  quadratic  response  surface  equation  was  used 
to  represent  the  results.  The  relatively  low  level  of  model  fidelity,  the  sparse  sampling  scheme, 
and  a  lack  of  reporting  regarding  the  accuracy  of  the  resulting  surrogate  models  limit  the 
similarity  to  the  project  at  hand. 

No  studies  attempting  to  model  high-fidelity  aerodynamics  as  a  function  of  many  geometric 
parameters  were  identified  during  the  literature  search.  Lacking  such  guidance,  experiments 
were  designed  to  assess  the  effectiveness  of  different  possible  sampling  plans. 

4.3  Preliminary  Experiments 
4.3.1  Identifying  Similar  Studies 

First,  anI-Optimal  Design  of  Experiments  of 2, 048  cases  was  constructed.  The  I-Optimal 
design  distributes  samples  in  such  a  way  that,  when  a  surrogate  model  is  created  from  those 
samples,  the  average  predictive  variance  is  minimized.  [91]  Note  that  this  design  does  not  take  any 
of  the  sampling  results  into  account,  and  thus  is  an  a  priori  sampling  design.  This  DOE  tends 
to  emphasize  the  corners  and  edges  of  the  space,  which  serves  to  minimize  or  eliminate  the 
amount  of  design  space  for  which  the  surrogate  model  would  have  to  extrapolate. 

In  addition,  a  2,500  case  Latin  hypercube[126]  was  generated.  With  this  sampling  approach, 
the  user  selects  the  number  of  samples  to  be  performed,  N,  and  the  range  of  each  design 
variable  is  divided  into  N  equal  portions,  or  “bins”.  Samples  are  then  selected  so  that  no 
two  samples  fall  into  the  same  bin  for  any  input  variable,  which  effectively  creates  a  uniform 
sampling  over  each  variable.  The  process  of  selecting  these  samples  may  be  performed  in 
multiple  ways,  such  as  maximin  sampling  which  maximizes  the  minimum  distance  between  any 
two  sample  points. [132] 

To  evaluate  the  accuracy  of  surrogate  models,  an  additional  set  of  test  points  were  generated. 
These  test  points  consisted  of  2,000  random  configurations.  Each  surrogate  model  would  be 
used  to  make  predictions  for  the  response  values  at  each  point;  these  predictions  would  then 
be  com-  pared  against  the  true  values  to  determine  which  sampling  &  modeling  approaches 
would  be  most  effective. 

Both  designs  were  used  to  sample  the  design  space  at  two  flight  conditions.  One  condition 
represented  a  plausible  landing  scenario  at  Mach  0.3,  cr  10°.  The  other  condition  represented 
an  earlier  phase  in  the  gliding  trajectory  at  Mach  3.0,  or  30° .  Both  flight  conditions  had  0°  of 
sideslip.  All  6,000  vehicles  were  modeled  at  each  flight  condition  and  the  results  compiled. 

A  variety  of  surrogate  models  were  created  using  these  data  sets  in  order  to  identify  the  best 
approach. 
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4.3.2  Surrogate  Modeling  Experiments 

A  number  of  different  teehniques  exist  for  ereating  surrogate  models.  One  of  the  most  direet, 
known  as  response  surface  methodology  (RSM),  uses  polynomial  equations  to  deseribe  the 
behavior  of  the  response.  [33]  RSM  ean  be  very  effeetive  when  applied  to  engineering  problems, 
as  the  responses  of  many  physieal  systems  ean  be  eaptured  with  seeond-order  models.  RSM 
models  are  eommonly  fitted  using  least  squares  methods  to  find  the  eoeflfieients  that  best  mateh 
the  observed  data.  [91]  If  the  response  is  more  eomplex,  higher-order  terms  may  be 
ineorporated,  but  this  may  be  limited  by  the  quantity  of  data  available:  there  must  be  at  least 
as  many  data  points  as  eoeflfieients  to  be  estimated.  Thus,  it  may  be  diflfieult  to  obtain 
suflheient  data  to  fit  a  third-  or  fourth-order  RSM  model,  espeeially  if  there  are  many  design 
variables.  Stepwise  regression,  in  whieh  higher-order  terms  are  added  iteratively  to  the  model 
when  they  appear  to  be  benefieial  to  model  aeeuraey,  was  performed  to  investigate  whether  a 
partial  inerease  in  model  order  would  improve  the  fit  of  the  RSM  models.  Even  when  eonsidering 
terms  up  to  fifth  order,  model  aeeuraey  was  still  very  poor. 

Another  eommon  modeling  teehnique  is  Kriging.  Kriging  is  a  statistieal  modeling  method  that 
represents  the  response  behavior  as  a  eombination  of  some  underlying  mean  funetion  and  a 
stoehastie  funetion  that  has  a  mean  of  zero  whieh  deseribes  deviations  from  this  mean.  [170] 
Kriging  is  an  exaet  interpolator,  whieh  means  that  it  will  exaetly  reproduee  the  known  response 
values  at  the  training  points.  [3  9]  Fitting  a  Kriging  model  requires  the  inversion  of  a  matrix  of 
dimensions,  where  n  is  the  number  of  training  data  points.  As  a  result,  fitting  the  model 
typieally  involves  a  eomputational  burden  of  order  0{n^)  and  a  memory  burden  of  order  0{n^) , 
whieh  ean  beeome  signifieant  as  n  beeomes  large  (on  the  order  of  a  few  thousand).  [136]  The 
DACE  toolbar  for  Matlab[107]  was  used  in  an  attempt  fit  Kriging  models,  but  memory 
requirements  exeeeded  the  available  resourees  when  applied  to  this  problem.  As  a  result, 

Kriging  was  not  used  for  this  portion  of  the  effort. 

Artificial  neural  networks  are  a  third  teehnique  for  surrogate  modeling  that  have  been  used  for 
aerospaee  applioations.[171]  Neural  networks  eonsist  of  one  or  more  hidden  layers  and  an 
output  layer.  Eaeh  layer  has  a  number  of  nodes,  known  as  neurons,  and  eaeh  neuron  has  an 
aetivation  funetion  whieh  eombines  values  from  the  previous  layer  with  weighting  terms. 

Onee  the  speeifie  form  of  the  aetivation  funetion  is  ehosen,  the  weighting  terms  are  optimized 
to  best  fit  the  training  data.  [91]  Hornik  demonstrated  that  neural  networks  with  a  single  hidden 
layer  ean  aet  as  universal  approximators,  reprodueing  any  funetion  to  an  arbitrary  degree  of 
aeeuraey,  for  suffieiently  large  networks. [82]  This  ability  to  mateh  any  function  is  offset  by 
the  problem  of  fitting  the  network;  for  n  input  values,  each  neuron  in  the  hidden  layer  will 
have  n  +  \  free  terms,  and  for  m  neurons  the  output  layer  will  have  m+\  free  terms.  Fitting  a 
network  of  m  nodes  to  a  problem  with  n  input  dimensions  thus  requires  the  optimization  of  1  + 

mX{2  +  n)  weights,  which  can  be  time-consuming. 

Each  of  these  methods  was  applied  to  the  initial  data  set  to  determine  the  accuracy  of  the 
resulting  model.  One  surprising  result  was  that  models  trained  using  the  I-Optimal  points 
tended  to  perform  worse  than  those  that  were  not.  A  rule  of  thumb  is  that  models  trained  using 
larger  data  sets  tend  to  be  more  accurate;  in  this  case,  models  trained  using  the  combination  of 
the  Eatin  hypercube  and  I-Optimal  data  sets  performed  worse  than  models  trained  using  only 
hypercube  points.  Recall  that  the  I-Optimal  DoE  tends  to  sample  the  edges  and  comers  of  the 
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design  space;  this  result  suggests  that  samples  closer  to  the  middle  of  the  space  provide  better 
information  about  the  behavior  of  the  responses  throughout  the  space.  This  opinion  was 
bolstered  by  using  a  training  set  trained  to  fit  points  from  the  hypercube  data  set  and  part  of 
the  random  data  set.  This  ad  hoc  model  was  compared  against  a  model  trained  only  on 
hypercube  points  by  applying  both  models  to  the  unused  random  points.  The  ad  hoc  model, 
with  a  larger  training  pool  of  points  distributed  more  or  less  evenly  throughout  the  space, 
demonstrated  better  accuracy  than  the  model  with  the  smaller  training  pool. 

4.3.3  Preliminary  Conclusions 

The  initial  experiments  aimed  to  identify  the  sampling  and  modeling  approaches  that  produced 
the  most  accurate  surrogate  models  with  the  fewest  samples.  Based  on  the  observed  results, 
neural  network  models  trained  with  space-filling  samples,  such  as  those  from  Latin  hypercubes 
or  random  distributions,  were  the  best  choice.  The  models  created  using  2,000  to  3,000  points 
were  still  less  accurate  than  desired,  unfortunately,  which  indicated  that  more  samples  would  be 
required  to  create  useful  aerodynamic  surrogates  for  individual  flight  conditions. 

4.4  Main  Effort 

4.4.1  Sampling  Approach 

Recall  that  Latin  hypercubes  are  designed  a  priori,  with  the  user  specifying  the  desired 
number  of  samples  in  advance.  If,  after  the  experiments  are  completed,  it  is  discovered  that 
more  data  is  needed  to  improve  surrogate  accuracy,  it  may  be  diflhcult  to  add  more  samples  while 
still  preserving  the  good  space-filling  qualities  of  the  samples.  Qian[152]  has  addressed  this 
limitation  by  proposing  a  modified  Latin  hypercube  approachknown  as  nested  Latin  hypercube 
(NLHC)  design. 

NLHCs  are  generated  by  specifying  the  smallest  hypercube  size  desired,  a  growth  factor,  and 
the  number  of  nested  levels  to  be  included.  Each  level  serves  a  space-filling  hypercube.  To 
illustrate  the  concept,  consider  a  minimum  hypercube  of  5  points,  a  growth  factor  of  2,  and  3 
levels.  The  resulting  NLHC  will  have  20  points,  and  all  20  points  form  a  Latin  hypercube. 

This  is  depicted  in  Figure  2c.  Unlike  most  hypercubes,  however,  the  first  5  points  will  also  be 
a  valid  hypercube  (Figure  2a),  and  the  first  10  points  will  form  a  third  hypercube  (Figure  2b). 
This  is  a  very  useful  quality  if  the  user  wishes  to  sample  the  space  with  a  Fatin  hypercube  but 
there  is  uncertainty  as  to  the  number  of  cases  that  should  be  analyzed.  Due  to  the  infill 
approach  used,  the  levels  of  nested  Fatin  hypercubes  have  geometric  growth  rates  (e.g.,  the  next- 
larger  hypercube  will  be  double,  triple,  etc.  the  size  of  the  previous  hypercube). 


23 

Approved  for  public  release;  distribution  unlimited 


I 

0.8 

0.6 

0.4 

0.2 

0 


0  0.2  0.4  0.6  0.8  I 

(a) 


1 

0.8 

0.6 

0.4 

0.2 

0 


0  0.2  0.4  0.6  0.8  1 

(b) 


1 

0.8 

0.6 

0.4 

0.2 

0 


□ 

□ 


0  0.2  0.4  0.6  0.8  I 

(C) 


Figure  2:  A  Nested  Latin  hypercube  with  3  levels:  (a)  5,  (b)  10  and  (c)  20  points 

To  sample  the  RBS  design  space,  a  NLHC  was  generated.  Surrogate  models  trained  with  3,000 
points  had  shown  moderately  good  performance,  so  it  was  expected  that  not  many  more 
points  would  be  required  for  truly  accurate  surrogates.  Thus,  a  NLHC  was  generated  with  a  base 
hypercube  of 4,000  points  and  larger  hypercubes  of  8,000  and  16,000  points.  Based  on  the  fit 
quality  for  3,000  points,  it  was  expected  that  16,000  points  would  be  far  more  than  sufficient. 


An  additional  2,000  point  hypercube  was  generated  with  the  wing  trailing  edge  fixed  at  the  rear 
of  the  fuselage,  as  this  was  expected  to  a  region  of  the  design  space  where  aerodynamic 
moments  were  close  to  zero.  This  additional  sampling  would  thus  enhance  model  accuracy  in 
this  region  of  the  design  space.  An  additional  2,000  random  cases  were  generated  to  test 
surrogate  model  performance  throughout  the  design  space.  Once  all  cases  were  defined, 
geometry  generation  with  PaceLab  began.  A  small  minority  of  cases,  less  than  1  percent,  failed 
to  produce  viable  surface  meshes  as  determined  by  the  Cart3D  meshing  utility.  This  was 
deemed  an  acceptable  degree  of  loss.  Once  built,  cases  could  be  analyzed  with  Cart3D. 


Figure  3:  Cart3D  Lift  and  Pitching  Coefficients  for  LGBB  (from  AIAA  2003-3788) 


4.4.2  Flight  Conditions 

Because  the  preliminary  experiments  failed  to  produce  sulficiently  accurate  surrogates  at 
either  flight  condition,  the  main  experiments  also  treated  flight  conditions  as  discrete,  rather 
than  continuous,  variables.  A  separate  surrogate  model  would  be  created  for  each  response  at 
each  flight  condition.  Once  adequate  surrogate  performance  was  demonstrated  at  each  flight 
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condition,  the  surrogates  would  be  expanded  to  interpolate  between  flight  eonditions  and  make 
predietions  over  the  entire  trajeetory  space. 

The  three  flight  condition  parameters  -  Mach  number,  angle  of  attack  a,  and  sideslip  angle 
P  -  were  discretized  to  cover  the  expected  trajectory.  The  flight  conditions  that  were  seleeted 
for  analysis  were  based  in  part  on  the  CartSD  results  for  the  Langley  Glide-Back  Booster 
published  by  Chaderjian  et  al.[26]  These  results  demonstrated  smooth,  regular  changes  in  eaeh 
response  for  Mach  numbers  above  2;  this  contrasted  with  the  relatively  large  variations 
observed  at  slower  speeds.  These  variations  with  Mach  number  dwarfed  those  due  to  angle  of 
attaek.  The  lift  coeflheient  and  pitehing  moment  eoeflheient  for  the  Langley  Glide-Baek 
Booster  at  zero  sideslip  are  displayed  in  Figure  3. 

As  a  result,  Mach  number  was  sampled  finely  compared  to  angle  of  attack  or  sideslip,  with 
partieular  emphasis  on  speeds  below  Maeh  2.  The  flight  eonditions  that  were  analyzed  may 
be  found  in  Table  1.  Note  that  the  flight  conditions  were  eombinatorial:  each  possible 
combination  of  Mach  number,  angle  of  attack,  and  sideslip  angle  was  used.  The  lone  exception 
was  15°  sideslip,  which  was  only  used  for  Maeh  numbers  0.3  and  0.5.  This  produced  a  set  of  48 
flight  conditions. 


Table  1:  Flight  Conditions  for  Main  Experiments 


Mach  number 

0.3 

0.5 

0.8  0.9  1.1  2.5  4.0 

Angle  of  Attaek 

0 

15 

40 

Sideslip  Angle 

0 

5 

15 

Each  configuration  was  analyzed  at  every  flight  condition.  As  a  result,  even  if  only  the  smallest 
level  of  the  nested  Latin  hypereube  (4,000  cases)  were  analyzed,  the  full  analysis  would  still 

require  4, 000  X  48  =  192, 000  analyses  with  Cart3D.  Despite  the  relatively  sparse  sampling  of 
flight  eonditions,  this  was  a  60-fold  increase  over  the  3,000  or  so  inviscid  analysis  performed 
by  Chaderjian  et  al.  when  a  single  configuration  was  being  analyzed.  The  large  inerease  in 
computational  effort  serves  to  illustrate  the  potential  expense  of  design  space  exploration 
when  higher-fidelity  models  must  be  used. 

4.4.3  Running  Analyses 

To  support  this  effort,  students  participating  in  this  research  project  were  granted  access  to 
computing  resources  at  Department  of  Defense  High  Performance  Computing  Centers  (HPCCs). 
These  resourees  could  complete  a  single  analysis  in  thirty  to  sixty  minutes,  and  eould  process 
hundreds  or  thousands  of  cases  simultaneously. 

These  modeling  efforts  were  terminated  according  to  schedule  constraints  in  order  to  leave 
sufheient  time  to  ereate  and  test  the  surrogates  and  document  the  results  before  the  end  of  the 
eontraet.  The  first  8,000  eases  of  the  nested  Latin  hypereube  were  completed  for  each  flight 
condition,  as  well  as  4,000  additional  cases  for  testing  predictive  accuracy  of  the  surrogate 
models.  The  testing  cases  varied  between  flight  conditions  depending  on  which  cases  were 
available.  When  modeling  ended,  nearly  579,000  eases  had  been  submitted  for  analysis.  Some 
of  those  cases  failed  during  analysis,  or  did  not  eonverge  suffieiently  by  the  end  of  simulation. 
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After  filtering  out  those  unusable  results,  slightly  more  than  559,000  eases  remained.  Every 
flight  eondition  had  at  least  10,000  eases,  with  an  average  of  1 1,500  oases  per  flight  oondition. 
This  effort  took  roughly  two  months  and  oonsumed  over  two  and  a  half  million  prooessor- 
hours  of  HPCC  resouroes.  If  a  oontemporary  quad-oore  desktop  had  been  used  for  the 
analysis,  it  would  have  taken  more  than  73  years  to  oomplete. 

4.4.4  Evaluating  the  Resulting  Surrogates 

It  was  something  of  a  surprise  to  disoover,  in  light  of  the  number  of  analyses  performed,  that  the 
resulting  surrogate  models  still  had  somewhat  poor  predictive  accuracy.  The  smallest  95  percent 

confidence  interval  for  predicting  pitching  moment  coelficient  was  ±0.26,  and  the  average  95 

percent  confidence  interval  was  ±0.83.  These  are  very  broad  confidence  intervals,  given  that 

the  Langley  Glide-Back  Booster  pitching  moment  coefficient  never  exceeded  ±0.1  throughout 
its  entire  flight  envelope. [26] 

A  variety  of  techniques  were  applied  in  an  attempt  to  improve  model  accuracy.  The  scale  of 
the  task  -  6  coefficients  at  each  of  48  flight  conditions,  for  a  total  of  288  responses  -  precluded 
the  development  of  a  custom  approach  for  each  surrogate.  Instead,  one  flight  condition  (Mach 
0.5,  or  0°,  j8  0°)  was  selected  as  a  test  case  to  identify  strategies  that  might  improve  all  of  the 
surrogate  models. 

Because  the  ranges  of  the  design  variables  could  produce  aircraft  with  very  large  or  very  small 
wings,  the  reference  areas  by  which  the  forces  and  moments  were  normalized  could  vary 
greatly  across  the  design  space.  Models  were  fit  to  force  and  moment  values  that  were  only 
normalized  by  dynamic  pressure  to  remove  the  effects  of  reference  area,  but  no  improvement 
in  model  accuracy  was  observed.  Transformations  of  the  responses  -  including  squaring, 
square  root,  exponential  and  natural  log  operators  -  were  applied,  without  improvement  in 
accuracy.  Outliers,  identified  by  Mahalanobis  distance  calculations[l  15]  in  the  statistical 
analysis  software  JMP,[91]  were  removed  before  fitting  models,  without  any  increase  in 
accuracy. 

One  technique  was  found  to  be  helpful.  Surrogate  models  of  pitching  moment  coefficient  were 
more  accurate  if  predictions  for  lift  and  drag  coefficients  were  included  in  the  training  data 
for  each  point.  This  did  not  affect  the  accuracy  of  the  lateral  moment  models,  but  the  95 
percent  prediction  confidence  interval  for  pitching  moment  at  the  test  flight  condition  shrank 

from  ±0.5  to  ±0.4.  This  was  considered  to  be  sufficient  improvement  that  it  was  worth  the 
effort,  and  was  used  as  the  model  training  method  for  moments  at  all  flight  conditions. 

As  previously  stated,  the  most  accurate  surrogate  model  for  pitching  moment  coelficient  had  a 

95  percent  prediction  confidence  interval  of  ±0.26.  This  model  was  at  a  flight  condition  of 
Mach  2.5,or0°,&j8  5°.  With  a  prediction  confidence  interval  that  large,  even  if  a  certain 
configuration  is  predicted  to  have  exactly  zero  net  pitching  moment,  there  is  so  much 
prediction  uncertainty  that  there  is  a  44  percent  chance  that  if  the  configuration  were  analyzed 
with  Cart3D  it  would  be  found  to  have  a  pitching  moment  coefficient  so  large  it  would  be 
uncontrollable.  Pitching  moment  models  for  other  flight  conditions  had  larger  uncertainties, 
which  corresponded  to  lower  confidence  in  predicted  performance. 
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In  light  of  the  observed  model  performanees,  it  was  deeided  to  analyze  all  available 
eonfigurations  at  a  single  flight  condition  to  estimate  the  number  of  cases  necessary  for  some 
particular  model  conhdence.  Twenty  thousand  conhgurations  were  analyzed  at  Mach  0.5,  cr  0“, 
jS  0°.  These  configurations  included  ah  16,000  cases  in  the  NLHC,  along  with  the  2,000 
random  cases  and  the  2,000  cases  with  the  wing  aligned  to  the  rear  of  the  fuselage.  Surrogates 
were  trained  using  each  level  of  the  NLHC,  i.e.,  4,000,  8,000,  and  16,000  cases,  and  the 
predictive  accuracy  of  those  surrogates  was  tested  using  the  non-NLHC  cases.  The  prediction 
error  distributions  of  the  three  surrogates  are  plotted  in  Figure  4. 

The  95  percent  prediction  conhdence  interval  for  the  4,000  point  model  was  ±0.73.  For  the 

8,000  point  model,  the  same  interval  was  ±0.56,  and  for  16,000  points  the  interval  was  ±0.42. 
The  residual,  which  is  the  discrepancy  between  surrogate  prediction  and  true  response  value, 
grew  smaller  as  the  pool  of  training  data  was  enlarged.  However,  it  shrank  relatively  slowly;  the 
increase  in  accuracy  between  the  second  and  third  model  was  less  than  between  the  hrst  and 
second,  despite  adding  twice  as  many  samples  to  the  training  pool.  This  suggested  that  a  highly 
accurate  surrogate  model  would  require  a  very  large  number  of  samples. 

A  very  rough  estimate  was  made  of  the  rate  of  convergence  relative  to  the  number  of  training 
points.  Linear,  exponential,  and  logarithmic  regressions  were  made  of  the  data  points,  with 
the  logarithmic  regression  matching  the  most  closely  (R^  =  0.997,  which  indicated  that  less 
than  1  percent  of  the  variation  in  the  response  was  not  being  captured).  The  results  suggested 
that  as  a  loose  approximation,  60,000  samples  would  be  required  at  this  flight  condition  to 

achieve  a  95  percent  prediction  conhdence  of  ±0.15.  At  this  prediction  conhdence,  if  a 
conhguration  were  predicted  to  have  zero  net  pitching  moment,  the  chance  that  CartSD 

analysis  would  reveal  a  true  pitching  moment  larger  than  ±0.1  was  less  than  1  in  3. 
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2  2  •  '  2 
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Figure  4:  Prediction  Error  for  Models  Trained  on  (a)  4,000  samples,  (b)  8,000  samples,  &  (c)  16,000  samples 

Analyzing  that  number  of  cases  at  each  of  48  hight  conditions  would  amount  to  nearly  2.9 
million  analyses,  i.e.,  roughly  hve  times  the  expense  invested  in  the  RBS  project.  Such  an 
effort,  using  the  same  high-performance  computing  resources,  would  take  a  year  to  complete. 
This  would  require  such  an  investment  of  time  and  effort  that  it  might  be  considered 
impractical;  on  lesser  computing  systems,  it  was  almost  assuredly  infeasible. 
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The  research  objective  -  accurate  aerodynamic  surrogate  models  for  a  large,  multi-dimensional 
design  space  and  many  flight  conditions  -  appeared  to  be  out  of  reach  unless  a  superior 
approach  could  be  developed. 

4.5  Research  Questions 

Eggers[50]  and  Chaderjian  et  al.[26]  each  demonstrated  that  Euler-level  aerodynamic  models 
were  moderately  successful  at  capturing  the  performance  of  reusable-booster-type  vehicles.  In 
addition,  Pamadi  et  al.  demonstrated  that  lower-fidelity  analysis  tools,  such  as  APAS,  had 
significant  difficulty  modeling  a  reusable  booster  vehicle,  in  part  due  to  the  nonlinear  behavior 
of  pitching  moment  at  a  >  15°  for  that  vehicle. [144]  This  was  a  critical  shortcoming,  as  a 
substantial  portion  of  the  RTES  trajectory  may  take  place  at  such  angles. [79,  98,  144] 

This  suggested  that  for  vehicles  like  reusable  boosters,  where  nonlinear  effects  are  likely  to 
be  significant,  Euler  aerodynamics  may  be  the  minimum  level  of  fidelity  capable  of 
adequately  capturing  vehicle  performance.  Simpler,  faster  models  exist,  but  are  known  to  be 
deficient  for  conditions  which  may  constitute  a  large  portion  of  the  return  trajectory.  As  a 
result,  using  those  faster  models  to  support  design  decisions  introduces  the  risk  that  the 
configuration(s)  selected  for  further  simulation  will  later  be  found  to  have  deficient 
performance,  as  occurred  during  the  DER  reusable  booster  effort  described  in  Section  1.7. [52] 
This  creates  a  strong  incentive  to  use  higher-  fidelity  tools  early  in  the  design  process,  before 
parameter  values  are  fixed. 

However,  higher-fidelity  modeling  may  be  too  expensive  for  use  early  in  the  design  process.  The 
execution  time  required  to  analyze  each  concept  with  a  highly  accurate  model,  at  minutes  or 
hours  per  configuration,  can  be  prohibitive.  The  example  of  the  RBS  project  illustrates  how, 
although  surrogate  modeling  enables  the  rapid  estimation  of  the  results  of  said  highly  accurate 
model,  the  training  of  useful  surrogate  models  may  also  require  excessively  large  quantities  of 
data.  Thus,  no  matter  how  desirable  it  may  be  to  incorporate  higher-fidelity  tools  in  the  early 
stages  of  a  design  project,  the  current  state  of  the  art  makes  this  unlikely  or  impossible. 

This  leads  to  the  primary  research  question: 

Research  Question  1:  How  can  high-fidelity  modeling  be  feasibly  applied  earlier  in  the  design 
process,  despite  the  computational  expense? 

The  RBS  project  illustrated  some  of  the  critical  technical  challenges  that  must  be  addressed 
before  high-fidelity  modeling  can  be  used  for  design  space  exploration.  The  sampling  of 
flight  conditions  used  by  the  RBS  project  was  relatively  sparse.  Referring  back  to  the 
aerodynamic  results  for  the  Eangley  Glide-Back  Booster  depicted  in  Eigure  3  on  page  30,  both 
the  lift  coelficient  and  the  pitching  moment  coelficient  exhibit  variations  with  Mach  number 
and  angle  of  attack  that  would  not  be  captured  by  the  sampling  used  for  RBS.  This  is 
especially  true  for  the  variations  in  pitching  moment  at  subsonic  speeds. 

Yet  despite  the  coarse  sampling,  the  computational  effort  expended  during  the  RBS  project  was 
extremely  large.  Einer  sampling  would  drive  the  required  effort  up  even  further.  In  addition, 
the  computational  work  took  place  on  multiple  state-of-the-art  parallel  computing  systems 
simultaneously;  improved  computational  resources  for  faster  processing  are  not  likely  for  most 
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near-future  design  programs.  If  improved  resources  are  not  likely  to  be  available,  the  cost  to 
generate  the  data  must  be  reduced. 

4.5.1  Emphasizing  Useful  Regions  of  the  Design  Space 

A  great  deal  of  effort  was  expended  analyzing  configurations  throughout  the  design  space.  If 
the  goal  is  to  reduce  computational  expense,  the  question  might  be  raised:  are  all  results  of  equal 
value?  If  some  results  are  more  useful  than  others,  the  computational  expense  of  the  effort  might 
be  reduced  by  increasing  the  proportion  of  informative,  valuable  samples. 

Although  the  performance  of  every  vehicle  is  different,  rules  of  thumb  can  be  a  useful  way  to 
capture  trends  or  patterns  in  behavior  common  to  many  vehicles.  Carpenter[24]  suggested 
that,  for  the  range  of  flight  conditions  relevant  to  reusable  booster  systems,  configurations  with 

pitching  moment  coefficients  within  ±0.1  should  be  considered  likely  to  trim  given  reasonably- 
sized  control  surfaces.  Vehicles  exhibiting  larger  moments  may  have  ditficulty  achieving  trim. 
When  the  data  from  the  RBS  study  was  investigated,  it  was  found  that  a  large  majority  of  the 
configurations  exhibited  pitching  moment  coetficients  beyond  that  range,  sometimes  by  a  large 
margin.  The  distribution  of  pitching  moment  coefficients  at  Mach  0.5,  cr  0°,  /3  0°  (the  flight 
condition  with  the  largest  number  of  available  results)  is  shown  in  Figure  5 . 


Figure  5:  Partial  Distribution  of  Pitching  Moments  at  Mach  0.5,  a  0°,  jS  0° 

More  than  90  percent  of  the  configurations  tested  experienced  pitching  moments  so  large  that 
they  were  not  likely  to  be  a  feasible  vehicle.  This  suggests  that  much  of  the  design  space  is 
of  little  interest  to  vehicle  designers.  Analyzing  so  many  configurations  with  such  poor 
performance  was  directly  contrary  to  the  goal  of  maximizing  the  amount  of  useful 
information  gained  from  each  experimental  analysis.  Furthermore,  if  it  were  known  that  a 
certain  region  of  the  design  space  had  poor  performance  -  i.e.,  designs  in  that  region  were 
unlikely  to  trim  -  that  region  could  be  sampled  less  intensively,  which  in  turn  could  reduce  the 
overall  modeling  cost. 

Although  a  surrogate  model  that  is  highly  accurate  throughout  the  entire  design  space  is 
intellectually  satisfying,  the  RBS  effort  demonstrated  that  it  may  also  be  unacceptably  costly.  A 
simpler  solution  may  be  available:  surrogate  models  are  more  accurate  for  cases  similar  to 
those  used  to  create  the  surrogate. [92]  By  selectively  placing  samples  in  promising  regions  (i.e., 
those  with  aero-  dynamic  moments  close  to  zero),  models  trained  to  fit  the  resulting  data  set 
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could  exhibit  improved  predietion  aeeuraey  in  those  regions.  Regions  with  poor  performanee 
need  only  be  sampled  enough  to  identify  them  as  uninteresting. 

This  sampling  strategy  would  attempt  to  emphasize  eonfigurations  with  good  performanee. 
Surrogate  models  trained  on  the  resulting  data  pool  would  have  high  aeeuraey  in  well-sampled 
regions,  and  lower  aeeuraey  in  sparsely-sampled  regions.  If  the  sampling  strategy  is  sueeessful, 
the  result  would  be  a  surrogate  model  that  has  good  predietion  aeouraey  with  regard  to  regions 
where  the  moments  are  elose  to  zero,  and  suffieient  predietion  aeouraey  in  the  other  regions  of 
the  design  spaoe  that  those  regions  oould  be  identified  as  unattraotive.  This  falls  short  of  a 
surrogate  that  is  very  ao-  ourate  aoross  the  entire  design  spaoe,  but  it  does  promise  a  surrogate 
that  is  very  aoourate  in  regions  of  interest  to  the  designer. 

The  field  of  adaptive  sampling,  whioh  ohooses  the  next  experiment  based  on  previous  results,  is 
well-established,  and  an  overview  will  be  given  in  the  next  ohapter.  Most  of  the  literature 
foouses  on  maximizing  or  minimizing  some  response  value,  however.  This  may  be  of  little  use 
for  an  effort  suoh  as  this,  when  the  goal  is  to  find  the  regions  of  the  design  spaoe  where  the 
response  or  responses  take  values  within  oertain  ranges.  This  leads  to  the  next  researoh  question; 

Research  Question  2:  When  good  performanee  refers  to  responses  within  desirable  ranges 
rather  than  maxima  or  minima,  how  oan  regions  of  good  performanee  be  identified  and 
emphasized  during  the  sampling  prooess? 

Suoh  a  sampling  prooess  would  minimize  the  number  of  expensive  analyses  that  must  be 
performed  before  a  surrogate  of  useful  aeouraey  oould  be  trained,  by  restrioting  the  regions  of 
useful  aeouraey  to  only  those  likely  to  oontain  feasible  designs.  However,  the  reduotion  in 
modeling  expense  that  would  be  neoessary  to  bring  high-fidelity  surrogates  into  the  realm  of 
feasibility  is  likely  to  be  of  an  order  of  magnitude  or  more.  It  may  be  overly  optimistio  to 
expeot  that  an  improved  sampling  plan  alone  would  be  suffieient. 

A  large  faotor  in  the  overall  oost  is  the  per-analysis  oost  of  CartSD.  If  this  oost  oould  be  brought 
down,  or  the  dependenoe  on  CartSD  analyses  mitigated,  the  total  modeling  oost  oould  be 
reduoed  by  a  signifioant  margin. 

4.5.2  Reducing  Dependence  on  Expensive  Models 

Pamadi  et  al  oompared  the  aerodynamio  aeouraey  of  linear  aerodynamios,  Euler  CFD,  and 
wind  tunnel  data  for  the  Langley  Glide -Baok  Booster.[144]  Figure  6  displays  this  oomparison 
near  Maoh  0.3.  Similar  results  for  Maoh  1 .2  and  4.5  may  be  found  in  the  souroe  dooument. 
Although  the  linear  aerodynamios  prediotions  diverged  from  the  other  data  for  a  >  10°  or  so, 
the  results  at  smaller  O  all  agreed  well.  The  running  time  of  APAS,  the  low-fidelity  tool  used 
by  Pamadi  et  ah,  is  less  than  one  seoond  on  a  oontemporary  quad-oore  desktop  oomputer.  That 
is  orders  of  magnitude  faster  than  the  hour  or  so  to  oomplete  a  CartSD  analysis  on  the  same 
number  of  prooessors.  While  APAS  may  not  be  suflhoiently  aoourate  at  high  O  to  aot  as  the 
only  souroe  of  aerodynamio  data  over  the  flight  envelope,  the  signifioant  oost  reduotion  and 
relative  aeouraey  for  APAS  makes  it  attraotive  as  a  souroe  of  data  if  the  results  of  different  levels 
of  fidelity  oan  be  oombined.  This  raises  the  next  researoh  question; 
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Research  Question  3:  How  can  cheaper  analyses  be  integrated  with  high-fidelity  models  to 
reduee  the  overall  eost  of  design  spaee  exploration  or  exploitation? 


LGBB  Aero  R3.0.  Baseline,  M  =  0.2, 16  FT  M  =  0,3,  MSFC  M  =  0.3.  Cart  3D  M  =0.3 


a,  deg 

Figure  6:  Comparison  of  APAS,  Cart3D,  and  Wind  Tunnel  Results  for  LGBB  near  Mach  0.3 

If  less-expensive  sourees  of  data  ean  be  ineorporated,  eheaper  APAS  samples  ean  be  used  to 
reduee  the  eosts  of  data  generation.  Those  samples  ean  form  the  bulk  of  the  training  data  set. 

A  set  of  samples  will  also  be  analyzed  with  CartSD  to  evaluate  the  agreement  between  the  two 
sets  of  results.  At  best,  the  APAS  results  will  agree  elosely  with  the  CartSD  results,  and  few 
CartSD  analyses  will  be  required.  It  seems  prudent  to  expeet  that  this  will  not  always  be  the  ease. 

Remember,  however,  that  diserepaneies  between  the  two  models  are  unlikely  to  be  random; 
instead,  for  the  most  part  the  diserepaneies  are  likely  to  result  from  the  phenomena  eaptured 
(or  negleeted)  by  eaeh  model.  As  a  result,  it  seems  plausible  that  patterns  in  the  diserepaneies 
eould  also  be  eaptured  by  surrogate  models,  allowing  the  eorreetion  of  results  obtained  via  APAS 
to  values  similar  to  those  that  would  be  obtained  via  CartSD. 

In  the  worst-ease  seenario,  APAS  results  will  offer  no  insight  into  the  behavior  of  the  more 
expensive  CartSD  results.  In  this  ease,  the  eomputational  burden  would  be  almost  unehanged 
from  what  would  be  expeeted  without  the  use  of  multiple  analysis  tools:  the  eost  per  exeeution 
of  APAS  is  orders  of  magnitude  smaller  than  that  of  CartSD,  so  hundreds  or  thousands  of  APAS 
results  eould  be  obtained  for  the  amount  of  effort  required  to  analyze  a  single  extra  ease  with 
CartSD. 
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4.5.3  Uncertain  Data 

The  final  observation  made  from  the  RBS  project  results  was  the  poor  quality  of  the  surrogates 
for  lateral  responses.  One  of  the  measures  of  accuracy  for  surrogate  models  is  PP-,  which 
quantifies  the  amount  of  response  variation  that  is  captured  by  the  surrogate;  a  perfect 
surrogate  would  have  an  E?  value  of  1.0.  [134]  Many  of  the  surrogates  created  in  the  RBS 
project  had  E?  values  above  0.95.  The  average  E^  value  for  a  40°,  f3  0°  rolling  moment 
surrogates  was  0.25,  which  is  extremely  low.  This  suggests  that  the  surrogates  are  not  able  to 
capture  very  much  of  the  observed  behavior  of  those  responses. 

Upon  closer  inspection,  the  solution  history  for  some  CartSD  cases  was  found  to  exhibit  an 
oscillatory  behavior,  even  after  the  solution  had  essentially  converged.  This  oscillatory  behavior 
is  illustrated  in  Figure  7.  This  figure  plots  the  aerodynamic  moments  acting  on  a  notional 
business  jet  model,  which  was  included  as  a  test  problem  withCartSD.  For  this  demonstration  the 
angle  of  attack  was  increased  over  that  of  the  test  problem,  from  3°  to  40°,  and  the  Mach 
number  was  increased  from  0.84  to  0.9.  This  produces  a  large  region  of  separated  flow  over  the 
wing  similar  to  what  was  observed  in  some  sample  booster  configurations  during  the  RBS 
study.  This  is  not  likely  to  be  an  important  flight  condition  for  a  business  jet.  Instead,  this 
example  mimics  an  experiment  performed  late  in  the  RBS  study. 
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Figure  7 :  Oscillatory  Solution  Behavior 


A  cursory  review  of  the  time  history  of  the  three  moments,  depicted  in  Figure  7a,  shows  that  they 
have  essentially  converged  long  before  the  simulation  is  terminated.  Closer  inspection, 
however,  reveals  small  variations  or  jitters  -  the  solution  becomes  mildly  oscillatory  near 
iteration  300.  This  is  of  particular  interest  since  the  response  values  presented  as  output  by 
Cart3D  are  simply  the  last  values  calculated  for  those  responses;  if  the  solution  is  exhibiting 
noisy  behavior  rather  than  being  perfectly  converged,  that  noise  will  be  present  in  the  response 
values  that  are  reported.  For  the  pitching  moment  coefficient  (shown  in  Figure  7b)  the 
magnitude  of  the  oscillations  is  roughly  10  percent  of  the  average  value  of  the  response.  For 
yawing  moment  coefficient  (Figure  7c)  the  variations  are  of  equal  or  larger  order  of  magnitude 
as  the  average  response  value.  Rolling  moment  exhibited  similar  behavior  to  yaw  and  is  omitted 
for  visual  clarity.  The  two  lateral  responses  can  be  considered  comparatively  noisy,  resulting  in 
poor  surrogate  model  accuracy. 
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An  investigation  identified  that  this  oseillatory  behavior  was  the  likely  culprit  for  the  poor  ac¬ 
curacy  of  the  lateral  surrogate  models.  The  oscillations  appeared  to  stem  from  the  large-scale 
flow  separation  that  most  or  all  vehicles  experienced  at  high  angles  of  attack  and/or  transonic 
conditions. 

There  was  some  concern  that  the  oscillatory  solver  behavior  was  being  caused  by  some  aspect 
of  the  custom  PaceLab  tool  that  was  used  to  create  the  surface  meshes  of  the  configurations 
being  analyzed.  To  determine  whether  this  was  the  case,  an  example  problem  included  with 
CartSD  -  the  business  jet  model  mentioned  earlier  -  was  analyzed  at  one  of  the  conditions  with 
poor  surrogate  behavior.  The  oscillatory  solution  was  again  observed,  suggesting  that  the 
oscillations  stem  from  the  juxtaposition  of  an  inviscid  flow  solver  and  flow  conditions  that  are 
strongly  dependent  on  viscous  effects. 

When  the  reported  response  is  affected  by  some  characteristic  of  the  model  used,  the  effect  can 
take  two  forms:  a  bias  effect  or  a  random  effect.  Bias-type  effects  often  result  from  the 
simplifying  assumptions  in  a  model,  and  may  have  a  consistent  sign  or  even  a  predictable 
magnitude.  Many  multi-fidelity  modeling  techniques  take  advantage  of  such  predictability  to 
“correct”  the  responses  from  a  simpler  model  to  match  the  results  from  a  more  complex  model. 
For  example,  by  neglecting  viscous  effects,  CartSD  does  not  capture  drag  due  to  skin  friction, 
and  so  its  estimated  drag  coelficients  are  likely  to  be  less  than  what  would  be  measured  in  flight 
testing.  If  designers  are  aware  of  this  characteristic,  however,  corrections  can  be  made:  by 
adding  a  constant  Cd^  to  inviscid  CartSD  results,  Aftosmis  et  al.  were  able  to  closely  match 
much  of  the  drag  polar  computed  by  viscous  models  for  a  RAE  2822  wing.  [7]  Less-accurate 
results  were  obtained  when  the  same  approach  was  applied  to  the  DLR-F4  wing-body 
configuration.  Scharl  and  Mavris  observed  similar  results. [172] 

Effects  that  are  more  random  in  character,  on  the  other  hand,  are  harder  to  account  for.  This  is 
particularly  true  for  models  that  must  iterate  to  reach  a  solution,  as  the  final  result  may  have 
some  degree  of  oscillatory  behavior.  When  the  responses  are  of  different  orders  of  magnitude, 
oscillatory  behavior  that  is  negligible  for  one  response  may  significantly  affect  another.  If 
those  oscillations  are  large  enough  relative  to  the  magnitude  of  the  response,  the  reported 
response  value  could  be  overshadowed  by  the  noisy  effects  of  the  oscillations.  A  surrogate 
model  which  treats  those  cases  as  deterministic  -  that  is,  as  a  precise  representation  of  the 
response  value  for  that  case  -  may  attempt  to  reproduce  the  noisy  component  as  well,  with 
negative  consequences  for  its  predictive  accuracy.  It  is  believed  that  this  is  what  occurred  for 
some  surrogates  for  lateral  responses  during  the  RBS  project. 

Artificial  neural  networking,  the  surrogate  modeling  approach  used  in  the  RBS  project, 
approximates  the  data  rather  than  exactly  interpolating  it.  Small-scale  variations,  such  as  the 
iteration  noise  described  above,  might  not  affect  the  resulting  model  if  the  magnitude  is  small. 
The  poor  R^  values  observed  in  the  RBS  project  indicate  that  for  many  lateral  responses,  the 
variations  were  large  enough  to  have  a  negative  effect  on  the  predictive  accuracy  of  the  models. 
Rather  than  treating  all  data  as  deterministic  and  precise,  it  may  be  possible  to  estimate  the 
amount  of  uncertainty  present  in  the  data.  For  example,  in  addition  to  the  force  and  moment 
summaries  produced  by  CartSD  after  an  analysis,  the  iteration  histories  of  those  responses  are 
recorded  in  a  separate  file.  Using  these  histories,  the  amount  of  uncertainty  in  the  responses 
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can  be  calculated.  However,  even  if  such  information  were  available,  it  could  not  be 
incorporated  into  the  neural  network  model,  leading  to  the  final  research  question: 

Research  Question  4:  How  can  information  about  uncertainty  in  the  data  be  captured  and 
incorporated  effectively? 

The  scripts  used  to  run  CartSD  can  parse  the  force  and  moment  iteration  histories  and  report  the 
average  and  standard  deviation  values  of  each.  This  information  is  already  used  to  ensure  that 
any  case  which  did  not  adequately  converge  could  be  discarded.  The  information  is  then 
retained  but  not  used  during  the  modeling,  as  the  neural  network  tools  that  were  used  could  not 
incorporate  the  uncertainty  data.  Modeling  techniques  which  can  incorporate  uncertainty  data 
would  have  a  ready  source  of  noise  estimates. 

4.6  Review 

The  RBS  research  project  undertaken  by  the  AFRL  and  ASDL  illustrated  one  of  the  major 
reasons  why  extensive  design  space  exploration  is  seldom  undertaken  using  expensive  models: 
the  computational  effort  required  to  explore  the  space  is  enormous.  While  the  project  was  not 
successful  at  creating  true  aerodynamic  surrogate  models,  it  did  highlight  particular  aspects  of 
the  problem  that  were  particularly  troublesome.  If  these  difficult  aspects  could  be  addressed, 
the  cost  of  such  an  effort  might  be  reduced  sufficiently  to  make  the  effort  feasible.  A  set  of 
research  questions  have  been  formulated  to  guide  the  efforts  to  address  those  difficulties.  These 
research  questions  directed  the  literature  search,  which  is  presented  in  the  next  section. 
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5  Reducing  the  Cost  of  Information 

The  previous  ehapter  demonstrated  how  design  efforts  could  benefit  if  the  cost  of  information 
could  be  reduced.  A  variety  of  approaches  have  been  proposed  toward  this  end.  The  simplest 
and  most  obvious  is  the  increase  in  computing  power  that  occurs  over  time.  The  tools  that  had 
been  cutting  edge  are  becoming  cheaper  to  use.  APAS,  an  analysis  tool  with  over  30  years’ 
pedigree,  took  45  seconds  per  analysis  on  a  mainframe  in  1981, [45]  while  today  a  similar 
analysis  can  be  performed  on  a  common  desktop  computer  in  less  than  one  hundredth  of  that  time. 
If  a  desired  trade  study  is  too  expensive  to  be  performed  with  the  given  resources,  the  simplest 
response  would  be  simply  to  wait.  After  enough  time,  the  infeasible  becomes  the  feasible,  and 
the  feasible  becomes  the  commonplace. 

While  true,  this  is  more  of  a  trivial  solution  than  an  effective  strategy.  Programs  have  deadlines 
and  opportunities  can  be  missed.  What  is  needed  is  an  efficient  approach  to  obtain 
trustworthy  information,  particularly  when  many  design  variables  must  be  considered.  Such  an 
approach  could  be  particularly  valuable  if  it  were  general  and  not  bound  to  a  particular  tool. 
Once  developed,  the  approach  could  be  used  to  investigate  the  design  space  and  identify 
important  trends  and  interactions  in  the  response  behaviors.  That  knowledge  could  then  be  used 
in  support  of  future  project  decisions,  such  as  selecting  the  appropriate  parameters  and  ranges 
for  a  future  trade  studies. 

There  are  two  predominant  families  of  techniques  for  gaining  information  about  a  problem  when 
the  cost  per  function  evaluation  is  not  negligible;  optimization  and  surrogate  modeling.  Both 
have  long  pedigrees  in  engineering  analysis,  and  each  has  strengths  and  weaknesses  that  make  it 
more  or  less  appropriate  for  certain  applications. 

5. 1  Summary  of  Optimization  and  Surrogate  Modeiing 

Optimization  is  the  process  of  identifying  the  best  values  for  input  parameters.  The 
quantification  of  goodness  for  a  set  of  input  values  is  made  using  some  objective  function, 
which  is  often  a  weighted  combination  of  the  analysis  outputs.  The  process  may  be  considered 
a  directed  search  of  the  design  space;  points  with  poor  objective  scores  are  considered  to  be 
uninteresting,  and  knowledge  about  the  entire  space  is  only  considered  useful  to  the  extent  that  it 
helps  to  identify  cases  with  better  scores.  This  is  an  iterative  process,  relying  on  results  from 
previous  analyses  to  select  the  next  point  for  evaluation.  Techniques  for  selecting  the  next  case 
may  be  as  simple  as  random  selection,  or  may  incorporate  detailed  information  such  as  the  local 
gradient  of  the  objective  function. 

Gradient-based  optimization  uses  information  about  how  the  optimization  objective  varies  with 
small  changes  in  the  input  settings.  This  information  identifies  the  direction  of  maximum 
improvement  in  the  objective  function,  and  can  be  used  to  very  quickly  find  an  optimum 
solution.  An  optimum  is  a  point  where  any  small  change  in  the  inputs  would  negatively 
affect  the  objective  score.  Such  an  optimization  scheme  can  be  rapid  and  efficient  if  the 
calculation  of  local  gradient  is  not  expensive. [72]  Complex  objective  functions  may  have  more 
than  one  local  optimum,  though,  and  gradient-based  optimization  risks  becoming  trapped  at  a 
point  which  is  better  than  those  around  it  but  not  the  best  point  possible  -  a  local  optimum  rather 
than  the  global  optimum. 
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This  risk  may  be  addressed  through  the  use  of  repeated  optimizations  starting  from  different 
points  throughout  the  spaee,  or  even  stoehastic  methods  like  genetie  algorithms,  but  there  is 
no  guarantee  that  those  teehniques  will  identify  the  global  optimum.  Optimizations  may  still 
be  eon-  strained  by  resourees  and  running  time,  and  it  may  be  too  expensive  for  an  extensive 
design  spaee  seareh. 

No  matter  what  optimization  teehnique  is  used,  the  results  will  depend  on  the  ehoiee  of  objeetive 
funetion.  Crueially,  the  results  of  an  optimization  eannot  tell  the  investigator  what  the 
optimum  might  be  for  a  different  objeetive.  Should  another  response  be  ineluded  or  the 
relative  weights  of  objeetives  be  ehanged,  the  point  seleeted  by  the  previous  optimization  is 
unlikely  to  be  the  optimum  for  the  new  objeetive  funetion.  The  optimization  proeess  will  have 
to  be  repeated  every  time  the  objeetive  is  altered,  making  it  diflheult  to  play  “what  if’  analyses 
unless  the  optimizations  are  both  rapid  and  inexpensive.  For  this  reason,  optimization  methods 
are  seldom  used  for  design  spaee  exploration  unless  the  objeetive  is  known  and  unlikely  to 
ehange,  sueh  as  maximizing  lift-to-drag  ratio. 

The  other  primary  approach  to  acquiring  and  leveraging  knowledge  about  a  problem  through 
experimentation  is  through  the  use  of  surrogate  models.  Surrogate  modeling  methods  are  a 
collection  of  techniques  for  interpolating  or  curve-fitting  experimental  results  in  order  to 
estimate  the  value  of  responses  at  other,  unsampled  points. [156]  Surrogate  modeling  also  has  a 
long  history  of  use  in  engineering  analyses,  with  an  increase  in  the  complexity  of  techniques 
used  over  time.  A  major  strength  of  surrogate  models  is  the  execution  time:  once  a  model  has 
been  developed,  predictions  can  made  extremely  rapidly.  This  can  lead  to  large  savings  when 
the  original  model  requires  substantial  time  to  complete  an  analysis. 

Many  surrogate  modeling  techniques  place  strong  emphasis  on  reproducing  the  entire  range  of 
the  response  than  on  regions  deemed  to  be  good.  This  allows  the  model  to  be  applied  with  equal 
confidence  to  any  point  in  the  design  space.  As  a  result,  the  data  points  used  to  build  the 
surrogate  may  be  selected  in  advance.  Research  into  the  best  methods  for  selecting  which  data 
points  to  run  is  often  collected  under  the  title,  “Design  of  Experiments”  or  DoE.  A  shortcoming  of 
surrogate  models  is  that  the  amount  of  data  required  to  train  them  may  grow  large,  especially  if 
the  response  behavior  is  complex  or  there  are  many  input  parameters.  [72]  Many  experimental 
design  techniques  have  been  developed  in  the  quest  to  gain  the  most  knowledge  about  the 
response  for  the  least  cost. 

Note  that  the  two  approaches  described  here,  optimization  and  surrogate  modeling,  are  not 
mutually  exclusive  and  have  been  blended  in  various  fashions.  One  popular  technique  is  to 
create  a  surrogate  model  of  the  outputs  of  an  analysis  tool  and,  during  an  optimization,  use  hat 
surrogate  to  estimate  objective  function  values  instead  of  using  the  analysis  tool  itself.  This  can 
significantly  reduce  the  computational  expense  of  each  function  call,  particularly  when 
evaluation-heavy  optimization  techniques  such  as  genetic  algorithms  or  gradient  evaluation 
through  finite  differencing  are  used. 

Other  researchers  have  used  optimization  techniques  to  select  the  sample  that  would  be  most 
informative  for  training  a  surrogate  model.  The  process  of  selecting  the  next  sample  based  on 
prior  results  is  known  as  adaptive  sampling.  The  optimization  process  may  aim  to  identify  the 
points  with  the  maximum  prediction  uncertainty,  points  that  produce  a  maximum  or  minimum 
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response  value,  or  points  that  produee  a  partieular  response  value.  This  topic  will  be  revisited 
in  greater  detail  in  Section  5.7.1. 

Despite  all  the  work  that  has  been  done  in  these  fields,  designers  still  encounter  problems  -  like 
the  RBS  study  -  which  are  constrained  by  computational  costs.  Problems  involving  complex 
features,  such  as  transonic  flow,  typically  require  high-fidelity  modeling  to  accurately  capture 
response  behavior.  Those  high-fidelity  models  may  be  very  computationally  expensive.  Secondly, 
if  a  vehicle  must  operate  at  a  wide  variety  of  flight  conditions,  it  may  not  be  possible  to 
eliminate  many  design  variables  through  screening  tests,  as  was  the  case  in  the  RBS  study;  the 
variables  that  are  unimportant  for  one  response  or  flight  condition  may  strongly  affect  another, 
resulting  in  a  large  number  of  dimensions  which  must  be  considered.  Problems  with  many 
dimensions  are  typically  expensive  to  address  through  surrogate  modeling  and  certain 
optimization  techniques.  Finally,  some  investigations  cannot  adequately  be  captured  by  the 
objective  function  of  an  optimizer.  An  investigation  of  how  trailing  edge  sweep  angle  might 
affect  adverse  yaw  behavior,  for  example,  might  be  difficult  to  express  as  a  maximization  or 
minimization  of  some  objective  function. 

Faced  with  challenges  such  as  these,  it  can  be  difficult  for  designers  to  investigate  the  design 
space  in  a  cost-efficient  manner.  Surrogate  models  can  be  very  expensive  to  train,  especially  as 
the  necessary  prediction  accuracy  increases,  and  while  some  optimization  strategies  may  be 
cheaper  to  execute  they  offer  no  indication  of  how  the  result  might  change  if  the  objective 
function  were  altered.  If  designers  address  the  problem  inefficiently,  they  may  expend  their 
resources  without  collecting  enough  information  to  support  future  decisions,  and  the  project 
would  be  exposed  to  the  risk  of  dead  ends,  backtracking  or  even  failure.  This  provides  the 
incentive  to  research  techniques  that  might  address  some  of  the  factors  which  make  these 
problems  so  difficult.  These  techniques  include: 

•  methods  to  combine  data  at  multiple  levels  of  fidelity  to  reduce  the  need  for  high- 
fidelity  modeling  wherever  possible; 

•  adaptive  sampling  to  reduce  sampling  in  uninteresting  regions  of  the  design  space; 

•  techniques  to  estimate  the  uncertainty  of  a  response  to  quantify  confidence  in  each  data 
point;  and, 

•  techniques  to  incorporate  information  about  uncertainty  present  in  the  data  when  a 
surrogate  model  is  trained 

The  next  section  will  address  the  first  of  these  topics,  methods  to  combine  data  from  multiple 
levels  of  analysis  fidelity. 

5. 2  Multi-Fidelity  Me  thods 

Multi-fidelity  methods,  also  called  data  fusion  techniques,  are  techniques  which  combine  data 
from  multiple  separate  sources  into  a  unified  whole.  For  the  most  part,  those  data  sources  are 
models  with  different  levels  of  complexity,  and  therefore  with  different  costs  per  analysis. 
Although  it  is  usually  trivial  to  extend  a  data  fusion  method  to  handle  any  number  of 
contributing  models,  this  description  will  assume  that  two  models  will  be  used;  this 
assumption  is  made  for  the  sake  of  descriptive  simplicity.  Additionally,  the  description  will 
assume  that  all  analyses  are  computational  models,  but  it  should  be  noted  that  the  methods  are 
equally  effective  for  experimental  data  sources  such  as  wind  tunnel  or  flight  testing. 
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Typically,  the  cheaper  data  souree  aehieves  its  eost  savings  by  simplifying  or  negleeting  one  or 
more  phenomena  important  to  the  seenario  at  hand,  making  it  too  inaecurate  for  use  alone. 
Surrogate  models  may  also  replaee  the  eheaper  data  souree,  further  redueing  the  per-analysis 
eost  without  aflfeeting  aeeuraey.  Conversely,  the  eostlier  souree  eaptures  these  phenomena 
accurately  but  the  per-experiment  expense  makes  it  undesirable  or  impossible  to  generate  all  the 
training  data  required  for  an  aecurate  surrogate  model.  By  eombining  the  two,  an  accurate 
surrogate  maybe  trained  at  less  expense  than  if  the  more  aeeurate  souree  were  used  alone. 

If  the  exeeution  time  of  the  low-fidelity  data  source  is  not  trivial,  it  may  be  worthwhile  to  ereate 
a  surrogate  model  of  the  low-fidelity  results.  This  surrogate  model,  rather  than  the  original 
souree,  is  then  used  as  the  low-fidelity  data  souree  thanks  to  its  more  rapid  exeeution  time. 

5.2.1  Additive  and  Proportional  Correctors 

The  simplest  approaches  to  eombining  data  of  different  fidelity  levels  are  the  bias  and 
proportional  eorreetions.  These  approaehes  are  most  useful  when  the  high-fidelity  analysis  is 
so  expensive  it  eannot  be  performed  more  than  a  handful  of  times.  The  low-fidelity  model  is 
used  to  explore  the  entire  region  of  interest  and  the  predietions  are  recorded.  The  high-fidelity 
analysis  is  then  applied  to  a  few  of  the  same  eases  and  the  results  eompared. 

If  an  additive  corrector  (also  called  a  bias  eorrector)  is  used,  the  difference  between  the  two  is 
treated  as  a  bias  error  in  the  low-fidelity  model,  and  a  high-fidelity  approximation  ean  be 
obtained  by  adding  this  bias  estimate  to  eaeh  low-fidelity  result.  This  method  is  most  effective 
when  the  bias  is  not  expeeted  to  vary  within  the  design  spaee.  For  example,  many  simple 
aerodynamie  models  negleet  viseous  eflfeets  and  thus  eannot  eapture  skin  frietion  drag;  beeause 
skin  friction  drag  is  not  strongly  affected  by  angle  of  attaek,  a  more  aecurate  drag  polar  ean  be 
obtained  by  ealculating  the  skin  frietion  eontribution  for  one  ease  with  a  higher-fidelity  tool 
and  adding  it  to  the  drag  polar  produeed  by  the  low-fidelity  analysis. [7,  172] 

An  alternative  approaeh  is  to  treat  the  diserepancy  as  proportional  -  that  is,  to  assert  that  the 
low-fidelity  result  is  always  some  fraetion  of  the  high-fidelity  result.  To  eorrect  for  this,  rather 
than  being  augmented  by  some  eonstant  value,  the  low-fidelity  predietions  are  multiplied  by 
some  eonstant  value.  These  two  approaehes  to  eorreeting  the  less-aeeurate  predietions 
eneompass  a  majority  of  multi-fidelity  methods  in  the  literature. 

Naturally,  these  eorreetions  are  not  restrieted  to  eonstant  values.  If  multiple  high-fidelity 
analyses  are  possible,  a  proportional  eorrection  may  be  ealculated.  This  proportional  eorreetion  is 
a  linear  funetion  of  the  input  variable(s),  and  helps  to  eapture  errors  in  the  lower-fidelity 
analysis  that  are  not  constant.  This  is  helpful  when  the  low-fidelity  trend  is  in  the  eorrect 
direction  but  of  an  incorrect  magnitude.  Higher  order  eorreetions,  in  whieh  the  Taylor  series 
approximation  is  extended  to  inelude  additional  terms,  have  also  been  deseribed;[60]  these 
eorreetions  may  require  large  quantities  of  data  and  thus  beeome  very  eostly. 

The  eomplexity  of  the  eorreetion  is  limited  only  by  the  quantity  of  data  available  for 
eomparison.  Researehers  sometimes  ereate  surrogate  models  of  the  diserepaneies  between  the 
data  sources  to  more  aeeurately  estimate  the  high-fidelity  result,  treating  the  low-fidelity 
eontribution  as  eheap  or  trivial.  This  approaeh  is  often  eheaper  than  ereating  a  surrogate  model 
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of  the  high-fidelity  result  directly,  given  that  the  low-fidelity  predictions  will  capture  some  or 
most  of  the  high-fidelity  behavior. 

Examples  of  efforts  incorporating  one  or  both  of  these  approaches  abound,  primarily  focusing 
on  two  levels  of  fidelity  [28, 36, 48, 73, 155, 194]  Some  variations  on  these  methods  are  of 
particular  interest  and  shall  be  described  in  greater  detail. 

Huang  et  al.[83]  creates  a  separate  surrogate  model  for  each  level  of  fidelity,  capturing  the 
deviations  between  the  current  level  and  the  next-simpler  level.  These  models  are  fitted 
sequentially  rather  than  simultaneously,  simplifying  the  task  of  parameter  estimation.  Kennedy 
andO’Hagan[94]  combine  the  additive  and  proportional  correctors,  building  a  linear  model  for 
the  proportional  corrector  and  fitting  a  Gaussian  Process  to  the  remaining  discrepancy  for  use 
as  an  additive  corrector.  They  refer  to  this  process  as  Bayesian  calibration  of  computer  models. 
Note  that  fitting  such  surrogate  models  requires  the  inversion  of  covariance  matrices  which 
incorporate  all  training  data  points  rather  than  only  those  at  a  single  level  of  fidelity.  As  a  result, 
this  method  can  become  computation-  ally  arduous  for  large  data  sets. 

Qian  and  Wu[154]  and  Xiong  et  al.[194]  expand  on  this  approach,  replacing  the  simple  linear 
model  with  another  Gaussian  Process  which  allows  the  proportional  factor  to  vary  as  a 
function  of  input  settings.  Gano  et  al.[60]  demonstrated  a  hybrid  corrector,  constructed  as  a 
weighted  sum  of  additive  and  proportional  correctors.  The  weighting  function  evaluates 
nearby  data  samples  to  determine  the  utility  of  each  correction  style,  eliminating  the  need  for 
the  user  to  choose  one  or  the  other  a  priori. 

Although  the  additive  or  proportional  correction  methods  are  far  and  away  the  most  common  in 
the  literature,  they  are  not  the  only  methods  possible.  Research  indicates  that,  rather  than 
training  models  to  transform  the  output  of  the  inexpensive  models,  useful  high-fidelity 
surrogates  may  be  developed  via  a  scaling  of  the  inputs  of  the  models.  Robinson  et  al.[167] 
proposed  an  optimization  algorithm  which  utilizes  space  mapping,  a  technique  for  transforming 
the  inputs  of  the  lower-fidelity  model  in  such  a  way  that  the  model  outputs  match  the  higher- 
fidelity  results  to  at  least  the  first  order. 

5.3  Cokriging 

Other  multi-fidelity  methods  exist  as  well.  The  surrogate  modeling  technique  known  as  Kriging 
has  been  expanded  to  incorporate  data  from  multiple  sources.  Regular  Kriging  calculates  the 
covariance  between  each  data  point,  defined  with  one  covariance  equation  and  one  vector  of 
model  parameter  values.  The  user  may  capture  the  relationships  within  and  between  multiple 
sets  of  data,  by  expanding  the  model  to  include  multiple  covariance  equations.  This  family  of 
similar  techniques  is  known  as  cokriging.  [182]  Due  to  the  extra  covariance  equations,  the 
model  has  more  parameters  which  must  be  estimated,  making  the  fitting  of  a  model  a  more 
complicated  endeavor. 

Additionally,  multi-fidelity  efforts  often  incorporate  a  large  quantity  of  lower-fidelity  data  due  to 
its  relative  cheapness.  The  estimation  of  optimal  values  for  cokriging  model  parameters  requires 
the  inversion  of  the  covariance  matrix,  which  carries  a  computational  cost  of  order  0{N^)  where 
N  is  the  number  of  data  points.  As  a  result,  when  a  model  is  trained  on  a  large  data  set,  the 
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training  process  may  become  very  memory-  and  computationally-intensive.[136]  This  may 
become  infeasible  for  data  sets  larger  than  a  few  thousand  cases. 

To  simplify  the  problem  somewhat,  some  researchers[56,  184]  make  the  assumption  that  the 
lower-fidelity  data  is  independent  of  the  higher-fidelity  results.  A  Kriging  model  is  then  trained 
to  reproduce  the  lower-fidelity  data.  After  that,  the  discrepancy  between  the  higher  and 
lowerfidelity  responses  at  each  high-fidelity  sample  point  is  calculated,  and  another  Kriging 
model  is  trained  to  fit  the  discrepancy  behavior.  This  reduces  the  number  of  model  parameters 
being  simultaneously  optimized,  and  blurs  the  distinction  between  cokriging  and  the 
additive/proportional  correctors  described  above.  However,  note  that  even  after  the  model  has 
been  fit,  applying  it  to  predict  the  response  value  or  prediction  variance  at  some  point  would 
require  the  inversion  of  a  covariance  matrix  that  incorporates  all  data  points  at  all  levels  of 
fidelity,  although  if  the  inverted  matrix  is  saved  it  need  not  be  re-inverted  to  make  subsequent 
predictions.  This  is  similar  to  the  method  proposed  by  Kennedy  and  0’Hagan[94]  mentioned 
earlier,  and  like  that  method  can  become  computationally  expensive  when  applied  to  large  data 
sets. 

Ghoreyshi  et  al.[63]  uses  a  variant  of  cokriging  to  combine  low-  and  high-fidelity  data  to  reduce 
the  cost  of  building  an  aerodynamic  database  for  a  new  configuration.  In  this  approach,  the 
Kriging  surrogate  model  is  first  trained  to  reproduce  the  low-fidelity  data.  Next,  that  surrogate 
is  used  to  make  predictions  for  the  response  value  at  each  of  the  sites  of  the  high-fidelity  data. 

The  low-fidelity  predictions  are  then  treated  as  an  extra  input  dimension,  and  a  Kriging  model 
of  the  high-fidelity  data  is  fit  to  the  combined  input  data.  This  form  of  cokriging,  which  will 
hereafter  be  referred  to  as  “Ghoreyshi  cokriging,”  does  not  require  cross-covariance  terms,  and 
has  computational  expense  +  Nlj,)  rather  than  . 

Yamazaki  and  Mavriplis[195]  use  an  expanded  version  of  cokriging  to  construct  their  variable- 
fidelity  model,  combining  up  to  three  sources  of  data  simultaneously.  Gradient  information  is 
generated  using  adjoint  calculations  or  automatic  differentiation  of  the  analysis  tool  and  may 
also  be  incorporated  in  the  surrogate  model  through  additional  covariance  terms.  This  gradient 
information  may  be  used  in  a  Taylor  series  approximation  to  estimate  response  values  at  other 
points  close  to  a  known  point;  these  estimated  values  are  treated  as  lower-fidelity  data  because 
the  response  values  are  estimated,  not  known. 

5.4  Data  Harmonization 

Whereas  cokriging  attempts  to  model  responses  that  are  nominally  different  but  correlated, 
Baume  et  al.[14]  propose  an  approach  which  combines  multiple  sources  for  a  single  response 
which  they  call  “data  harmonization.”  Unlike  cokriging,  which  implicitly  assumes  that  the 
responses  being  modeled  are  correlated  but  not  identical,  data  harmonization  aims  to  combine 
disparate  sources  of  information  about  a  single  response.  This  philosophy  lends  itself  more 
easily  to  the  prediction  of  aerodynamic  responses  using  multiple  models  of  different  fidelity. 

The  motivation  for  this  approach  was  the  creation  of  a  unified  model  of  environmental  data 
across  national  boundaries,  using  data  sets  collected  by  European  nations.  After  selecting 
gamma  ray  dosage  as  a  test  data  set,  clear  biases  and  variations  were  identifiable  between 
sensor  results  from  different  countries.  If  these  biases  and  variations  were  not  addressed. 
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predictions  for  gamma  ray  dosage  from  the  resulting  models  would  exhibit  low  prediction 
confidence. 

To  account  for  these  factors,  Baume  et  al.  introduce  the  data  harmonization  approach  to 
modeling.  This  approach  is  similar  to  universal  Kriging,  in  which  a  polynomial  mean 
function  is  fit  to  the  data  while  discrepancies  from  this  mean  are  captured  through  a 
covariance  matrix.  In  the  gamma  ray  example,  the  effects  of  altitude  were  subtracted  as  a  known 
factor,  soil  composition  was  treated  as  an  unknown  effect,  and  a  country  code  was  introduced 
as  a  bias  term,  Gp ,  to  capture  variations  between  sets  of  data  provided  by  the  contributing 
nations.  The  data  harmonization  results  demonstrated  better  agreement  across  national 
boundaries,  all  other  parameters  being  equal,  as  well  as  increased  prediction  confidence. 

Data  harmonization  has  some  commonality  with  cokriging,  in  that  multiple  sources  of  data  are 
modeled  using  a  Kriging-based  approach.  Data  harmonization  is  set  apart  from  cokriging  by  its 
introduction  of  bias  variables,  both  known  and  unknown.  A  more  mathematically- focused 
description  will  be  given  in  Section  7.1.5. 

5.5  Summary  of  Multi-Fidelity  Techniques 

In  general,  researchers  have  found  multi-fidelity  methods  to  be  useful  techniques  for  reducing 
computational  cost  when  high-fidelity  predictions  are  desired.  Simpler,  faster  analyses  provide 
overall  trends  and  general  behavior,  while  slower  but  more  accurate  tools  provide  corrections. 
The  utility  of  a  multifidelity  approach  will  vary  somewhat  with  the  problem.  The  methods  are 
at  their  most  effective  when  the  lower  fidelity  analysis  is  less  expensive  than  the  higher- 
fidelity  analysis.  This  effect  is  enhanced  if  the  simpler  analysis  can  be  easily  reproduced  via 
surrogate  models.  The  degree  of  agreement  between  the  two  analyses  is  also  a  factor  -  the 
closer  the  agreement  between  the  two  levels  of  analysis,  the  more  easily  a  corrective  model  can 
be  trained.  Lastly,  the  expense  of  problem  formulation  should  be  considered;  effort  may  be 
required  to  ensure  that  both  analyses  are  applied  to  the  same  problem,  e.g.,  that  vehicle  geometry 
representations  match  as  well  as  possible.  Differences  in  the  input  and  output  data  requirements 
between  analyses  may  require  extra  preparation  effort. 

The  data  harmonization  technique  of  Baume  et  al.[14]  appears  to  lend  itself  directly  to  the  task 
at  hand.  Multiple  computer  models  will  be  applied,  but  each  will  be  estimating  the  same 
response  instead  of  separate  responses  which  are  correlated.  There  is  some  cause  for  concern 
when  applying  data  harmonization  to  problems  with  large  design  spaces;  typically  such 
problems  require  a  large  number  of  training  points  to  fit  accurately,  and  as  a  result  the  matrix 
inversion  required  to  fit  the  model  may  become  exorbitantly  computationally  expensive.  If  this 
proves  to  be  the  case,  research  into  sparse  methods  may  be  of  use. 

As  for  other  multi-fidelity  techniques,  true  cokriging  in  the  style  described  by  geo-statisticians[93] 
is  beyond  the  scope  of  the  current  effort,  as  no  implementation  of  cokriging  could  be  identified 
that  could  incorporate  more  than  three  input  dimensions. [145,  164]  Other  multi-fidelity 
methods  will  also  be  assessed  to  determine  which  of  them  is  the  most  effective  for  the  current 
problem.  These  methods  were  selected  on  the  basis  of  conceptual  simplicity  and  relative  ease 
of  implementation.  Additive  correction,  proportional  correction,  and  Ghoreyshi  cokriging  will 
all  be  assessed. 
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Unlike  data  harmonization,  all  three  of  the  other  teehniques  treat  eaeh  souree  of  data 
independently.  A  surrogate  model  is  trained  to  mateh  the  lowest-fidelity  data  souree,  and  this 
model  is  eombined  with  higher-fidelity  data  to  train  a  surrogate  model  for  the  next  data  souree. 
These  methods  need  not  handle  every  data  point  from  every  souree  simultaneously,  and  as  a 
result  they  will  be  mueh  less  vulnerable  to  the  expense  that  eomes  with  fitting  Kriging  models 
to  very  large  sets  of  data.  Despite  this,  if  the  quantity  of  data  in  use  does  grow  large,  sparse 
methods  may  be  of  use  here  as  well.  Sparse  methods  were  therefore  the  next  line  of  researeh  to 
be  eondueted. 

5.6  Sparse  Methods 

The  simplest  approaeh  for  limiting  the  computational  effort  required  to  fit  a  model  would  be 
to  select  a  covariance  function  that  decays  to  zero  quickly,  resulting  in  a  very  sparse  covariance 
matrix  if  the  data  set  is  spread  out  throughout  the  design  space.  [128]  This  would  produce  a 
model  that  is  heavily  dependent  on  the  underlying  mean  function,  diverging  from  this  mean  only 
in  the  very  close  neighborhood  of  the  training  data  points.  Such  a  decision  would  be  risky 
without  high  confidence  in  the  choice  of  the  mean  function.  Alternatively,  many  different  mean 
functions  could  be  tested  to  identify  the  one  or  ones  that  best  fit  the  data.  This  testing  process 
would  have  to  be  repeated  for  each  response  being  modeled. 

The  largest  objection  to  this  approach  is  that  the  region  of  influence  given  to  a  training  point 
by  a  covariance  model  is  defined  by  the  model  parameters,  which  are  estimated  from  the  data 
when  the  model  is  fit.  If  training  points  are  found  to  correlate  with  sites  relatively  far  away, 
optimal  model  parameters  will  reflect  this  behavior  and  the  covariance  matrix  will  not  be  sparse. 
Bounds  on  allowable  model  parameter  values  could  be  set  to  enforce  this  approach  to  sparsity, 
but  such  sparsity  might  come  at  the  price  of  surrogate  accuracy. 

A  different  approach  to  minimizing  computational  effort  through  sparse  methods  is  the  “Subset 
of  Data”  (SoD)  approach.  [157]  Rather  than  utilizing  every  sample  that  is  available  for  training, 
the  SoD  approach  trains  a  model  using  a  subset  of  the  available  data.  This  reduces  the  scale 
of  the  problem  by  discarding  information.  Information  loss  can  be  minimized  by  the  intelligent 
selection  of  the  subset.  The  best  subset  of  m  points,  chosen  from  the  N  points  available,  will 
be  that  which  produces  the  most  accurate  surrogate  model.  The  accuracy  of  a  surrogate  is  tested 
by  assessing  its  ability  to  correctly  predict  response  values  for  the  unused  training  samples. 
Selecting  the  best  subset  of  samples  can  itself  become  computationally  intensive.  However,  the 
effort  to  invert  a  matrix  of  dimension  m  can  grow  as  quickly  as  O(m^)  ,[136]  and  since  m  <  N, 
a  moderate  amount  of  effort  may  be  spent  on  subset  selection  before  this  approach  becomes 
more  costly  than  the  brute  force  approach. 

A  third  technique  found  in  the  literature  is  that  of  using  pseudo-inputs.  [179]  This  technique  is 
conceptually  similar  to  the  Subset  of  Data  approach,  in  that  m  training  points  are  used  to  build 
the  model  rather  than  the  full  N  points  of  the  available  data  pool.  The  critical  distinction  is  that 
pseudo-  input  methods  are  not  restricted  to  the  points  in  the  pool.  Instead,  the  locations  of  the 
control  points  are  considered  to  be  extra  model  parameters  that  must  be  estimated.  If  m  control 
points  are  desired  with  each  point  having  dimension  d  (equal  to  the  number  of  input  parameters 

being  modeled),  the  number  of  parameters  to  be  estimated  is  increased  hy  mXd.  As 
Snelson[179]  points  out,  this  may  result  in  an  intractable  problem  if  the  dimensionality  of  the 
problem  is  large,  and  in  response  they  suggest  projecting  the  input  space  into  a  lower- 
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dimensional  space,  asserting  that  significant  dimensional  reduction  can  often  be  achieved  for 
real  problems. 

In  general,  when  creating  a  surrogate  model  of  a  large  data  set,  most  strategies  for  reducing  the 
cost  of  fitting  the  model  emphasize  using  only  the  most  informative  points  (or  pseudo-points). 
A  large  portion  of  the  available  samples  are  therefore  used  only  indirectly  (during  selection  of 
the  best  subset  or  pseudo-set)  or  not  at  all.  If  the  model  used  to  produce  those  data  points  is 
expensive  to  run,  this  may  be  a  very  inefficient  use  of  resources.  In  order  to  make  the  best  use  of 
available  resources,  it  may  be  worthwhile  to  apply  a  more  iterative  sampling  process  that  aims  to 
identify  only  the  most  informative  data  points.  With  respect  to  the  motivating  problem  of 
predicting  RBS  aerodynamics,  if  such  an  iterative  sampling  method  could  identify  points  with 
near-zero  pitching  moments  it  would  dovetail  nicely  with  the  research  objective  of  emphasizing 
cases  with  good  performance. 

5.7  Selection  of  Experiments 

The  second  of  the  four  approaches  from  the  list  on  page  49  was  the  selection  of  samples  to 
emphasize  interesting  regions  of  the  design  space.  Many  approaches  exist  for  choosing 
interesting  cases  to  analyze.  The  “best”  case  will  depend  on  the  user’s  objectives.  If  prior 
knowledge  of  response  behavior  is  not  available,  it  may  be  necessary  to  do  an  initial  round  of 
experiments  to  investigate  the  response  characteristics.  This  is  sometimes  referred  to  as  a  “warm 
start”.  Options  for  selecting  this  warm  start  will  be  reviewed  before  true  adaptive  sampling 
techniques  are  discussed. 

The  process  of  a  priori  experiment  selection,  whether  for  a  warm  start  or  not,  is  typically  known 
as  the  Design  of  Experiments,  or  DoE.  Some  classical  designs,  such  as  Full  Factorial,  sample  the 
design  space  at  regular  intervals  and  explicitly  capture  the  corners  of  the  design  space,  where 
input  variables  take  their  minimum  or  maximum  values. [3 3]  Including  all  corner  points  in  a 
training  set  for  surrogate  models  eliminates  the  risk  of  extrapolation,  which  can  improve 
confidence  in  surrogate  predictions.  As  the  size  and  complexity  of  a  problem  increase,  however, 
response  behavior  in  the  interior  of  the  design  space  may  take  on  a  larger  role  in  surrogate 
model  uncertainty,  and  it  may  be  more  efficient  to  perform  experiments  which  fill  the  space 
effectively  rather  than  sampling  the  extremes.  This  was  noted  in  Section  2.3  as  part  of  the 
RBS  effort  with  the  Fatin  hypercube  &  I-Optimal  samples  acting  as  the  interior  and  extreme 
samples,  respectively. 

Fatin  hypercube  sampling  is  a  common  space-filling  DoE. [126]  Fatin  hypercubes  uniformly 
sample  each  design  variable,  distributing  points  throughout  the  space.  Unlike  the  Full 
Factorial  DoE,  Fatin  hypercubes  are  not  guaranteed  to  sample  the  corners  of  the  design  space, 
which  may  lead  to  extrapolation.  Fatin  hypercubes  with  more  points  will  fill  the  space  more 
densely,  which  will  better  resolve  the  space  and  reduce  potential  prediction  errors  due  to 
extrapolation  by  the  surrogate  model.  This  sampling  approach  has  the  advantages  of 
accommodating  any  number  of  samples  for  a  given  design  space  and  resolving  response 
behavior  in  the  central  regions  of  the  space,  but  has  the  disadvantage  of  poor  resolution  near  the 
edges  of  the  space. 

There  is  a  trade-off  between  the  number  of  “warm  start”  samples  and  the  number  of  adaptive 
samples:  a  larger  warm  start  will  result  in  more  information  about  response  behavior,  which 
leads  to  more  accurate  identification  of  interesting  regions  for  later  sampling.  On  the  other 
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hand,  if  the  experimental  budget  is  finite,  a  larger  warm  start  means  fewer  adaptively- 
selected  samples  may  be  evaluated.  The  user  will  have  more  information,  but  less  opportunity 
to  apply  it.  Once  some  knowledge  about  the  response  behavior  is  obtained,  adaptive  sampling 
can  begin. 

5.7.1  Overview  of  Adaptive  Sampling 

Adaptive  sampling  is  the  process  of  choosing  new  samples  based  on  prior  observations.  Once 
regions  of  interest  have  been  identified,  these  regions  are  sampled  more  extensively  to 
improve  understanding  of  nearby  response  behavior.  This  process  is  then  repeated  until  the 
response  is  considered  to  be  understood  suflhciently  accurately,  or  until  the  experimental 
budget  is  consumed.  Selection  of  the  next  sample  point  is  based  on  some  algorithm  which 
evaluates  points  and  identifies  the  one  that  is  “most  interesting”  according  to  some  criterion. 
The  choice  of  selection  criterion  is  what  distinguishes  different  adaptive  sampling  approaches. 

Many  adaptive  sampling  approaches  make  use  of  Kriging  models.  Kriging  models,  as  a  subset  of 
Gaussian  Process  models,  allow  the  user  to  not  only  predict  the  response  value  at  any  point,  but 
also  estimate  the  uncertainty  of  that  prediction  without  arduous  calculations.  This  uncertainty 
typically  goes  to  zero  for  deterministic  (i.e.,  noiseless)  training  points  and  increases  for  points 
farther  from  the  training  data. 

If  the  objective  is  to  create  a  surrogate  model  with  the  highest  possible  prediction  confidence,  the 
simplest  approach  would  be  to  place  samples  where  the  prediction  uncertainty  is  largest.  After 
this  point  is  sampled,  the  response  value  there  is  known  for  certain  and  nearby  predictions  can 
be  made  with  more  confidence.  This  is  an  attempt  to  maximize  the  information  gained  per 
experiment.[l  10]  Another  approach  would  be  to  identify  the  point  that  best  improves  the 
average  prediction  confidence  throughout  the  design  space;  this  average  confidence  can  be 
approximated  by  evaluating  how  the  confidence  would  change  at  a  large  number  of  test  points 
throughout  the  space. [18,  34]  Kleinjen  and  van  Beers[97]  describe  a  variation  on  that  metric, 
called  Integrated  Mean  Squared  Error,  which  multiplies  the  variance  at  each  test  point  by 
some  weighting  function.  However,  the  weighting  function  is  left  uniform  in  that  work. 

For  some  applications,  it  may  be  worthwhile  to  focus  on  particular  regions  of  the  design  space 
rather  than  improving  global  knowledge.  If  uninteresting  regions  can  be  identified,  they  may  be 
sampled  sparsely  so  that  more  promising  areas  may  be  investigated  more  thoroughly.  Once 
infeasible  regions  are  identified,  later  samples  can  avoid  those  regions. [103]  Alternatively,  if 
the  objective  is  to  optimize  a  response  value,  adaptive  sampling  can  be  used  to  identify  the 
sample  point  that  would  most  improve  knowledge  of  the  response  near  certain  values.  Such 
sampling  approaches  offer  improved  model  accuracy  in  the  attractive  regions  while 
minimizing  the  cost  to  sample  unattractive  regions.  The  question  then  becomes  how  to  identify 
these  attractive  regions. 

5.7.2  Adaptive  Sampling  for  Optimization 

The  most  common  goal  for  localized  sampling  is  the  optimization  of  a  response,  such  as 
minimizing  wing  weight  or  maximizing  lift-to-drag  ratio.  The  simplest  approach  would  be  to 
sample  the  point  which  is  predicted  to  have  the  optimum  response  value,  but  this  may  have 
undesirable  consequences:  if  samples  are  clustered  too  closely  together,  some  surrogate 
modeling  techniques  such  as  Kriging  may  encounter  numerical  problems.  To  avoid  this,  the 
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adaptive  sampling  algorithm  can  be  designed  to  encourage  a  certain  degree  of  design  space 
exploration. 

The  most  popular  strategy  that  combines  exploration  with  exploitation  is  the  Expected 
Improvement  function,  [173]  which  can  be  interpreted  as  the  likelihood  that  a  given  point  will 
have  a  better  response  value  than  the  current  best  observation.  This  likelihood  is  calculated  for  a 
candidate  point  based  on  the  predicted  response  value  and  uncertainty  of  the  prediction  at  that 
point.  This  prediction  uncertainty  is  typically  assumed  to  be  zero  at  sampled  points  and  to 
grow  for  points  farther  from  samples.  As  a  result,  the  El  function  has  incentive  to  select  points 
in  poorly-sampled  regions.  This  approach  is  sometimes  called  Efficient  Global  Optimization,  or 
EGO. 

Cox  and  John[38]  propose  the  Sequential  Design  for  Optimization  (SDO)  method,  which  also 
utilizes  the  prediction  and  uncertainty  at  each  point.  This  algorithm  also  calculates  the 
prediction  and  confidence  interval  at  each  candidate  point.  The  candidate  with  the  best 
confidence  bound  is  selected  as  the  next  sample  point.  This  may  be  a  point  with  a  very 
desirable  predicted  value  and  small  uncertainty,  or  a  point  with  a  moderate  predicted  value 
and  large  uncertainty.  Xiong  et  al.[194]  expanded  this  approach,  varying  the  number  of 
standard  deviations  used  for  the  confidence  interval  to  encourage  the  algorithm  to  explore 
regions  with  high  uncertainty  or  to  emphasize  regions  expected  to  have  good  response  behavior. 

Huang  et  al.[83]  extend  the  El  function  to  allow  sampling  at  different  fidelity  levels.  The 
function  is  modified  using  three  multiplicative  terms  which  capture  the  reduced  confidence 
associated  with  a  lower-fidelity  result,  the  benefit  of  repeated  samplings  for  noisy  functions, 
and  the  relative  costs  of  sampling  at  each  fidelity  level. 

5.7.3  Adaptive  Sampling  for  Other  Objectives 

Although  optimization  of  some  response  is  the  most  common  application  of  adaptive 
sampling,  algorithms  have  been  developed  which  sequentially  choose  samples  based  on  other 
criteria  as  well.  Earhang-Mehr  and  Azarm[55]  define  a  characteristic  certainty  width  (CCW) 
parameter,  which  is  used  to  represent  the  regularity  of  the  response  behavior  in  a  local  region. 
Earge  values  suggest  the  response  changes  slowly  and  simply  over  a  broad  region,  while  small 
values  indicate  large  or  rapid  shifts  in  the  response  over  a  small  distance.  Earge  values  thus 
indicate  regions  which  may  not  require  many  samples  to  capture  response  behavior,  while 
small  values  indicate  regions  where  additional  sampling  may  be  beneficial. 

Mackman  and  Allen[l  1 1]  score  candidates  for  sampling  according  to  the  predicted  local  non¬ 
linearity  of  the  response.  Nonlinearity  is  quantified  using  the  diagonal  of  the  Hessian  matrix, 
i.e.,  the  Eaplacian.  To  dissuade  the  algorithm  from  clustering  points  too  closely  together,  a 
separation  function  was  incorporated  that  grows  with  increasing  distance  from  existing  sample 
locations,  improving  the  scores  of  points  far  from  the  current  training  set  and  reducing  the 
scores  of  points  too  close  to  previous  samples. 

It  is  also  possible  that  the  objective  might  not  be  to  identify  points  with  maximum  or  minimum 
response  values,  but  to  find  points  where  the  response  has  a  particular  value.  This  may 
occur  in  reliability  modeling,  when  the  objective  is  to  identify  whether  or  not  a  case  exceeds 
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specified  constraints;  model  performance  is  better  served  by  accurately  capturing  this  failure 
contour  than  by  seeking  the  maximum  response  value. 

Ranjan  et  al.[159]  describe  a  sampling  criterion  to  best  improve  knowledge  about  the  response 
behavior  near  the  contour  of  interest.  This  criterion  includes  multiple  factors  to  balance  the 
sampling  style  between  sampling  points  predicted  to  fall  near  the  contour  and  sampling  points 
predicted  to  be  farther  from  the  contour  but  with  enough  uncertainty  that  the  contour  may  be 
closer  than  expected.  A  clustering  penalty  term  is  used  to  dissuade  the  algorithm  from  placing 
samples  in  regions  of  low  uncertainty,  i.e.,  close  to  existing  samples. 

Picheny  et  al.[149]  propose  an  alternative  contour  sampling  algorithm  based  on  the  Integrated 
Mean  Squared  Error  approach  described  by  Kleijnen  et  al.[97]  The  prediction  uncertainty  at  a 
multitude  of  test  points  is  combined  to  quantify  the  global  prediction  confidence  as  in  Kleijnen 
et  ah,  but  in  this  formulation  a  non-uniform  weighting  function  is  applied.  This  weighting 
function  is  a  measure  of  the  likelihood  that  the  response  at  the  current  test  point  is  close  to  the 
contour  of  interest.  Test  points  predicted  to  have  responses  close  to  the  target  will  have  their 
uncertainty  more  heavily  weighted.  The  algorithm  will  then  select  the  candidate  point  which 
most  decreases  the  IMSE  value,  i.e.,  the  candidate  which  most  improves  prediction  confidence 
for  points  near  the  threshold. 

5.7.4  Summary  of  Adaptive  Sampling  Techniques 

Many  sampling  algorithms  have  been  documented  that  promise  to  leverage  current  knowledge 
about  the  response(s)  of  interest  when  selecting  the  next  experiment  to  be  run.  Depending  on 
the  user’s  goals,  sampling  algorithms  exist  to  improve  global  model  confidence,  find  global 
maxima  or  minima,  identify  local  nonlinear  behavior,  or  accurately  capture  response  behavior 
near  some  desirable  threshold. 

Given  the  context  of  aircraft  design  and  the  concept  of  design-for-trim,  the  contour  sampling 
approaches  described  by  Ranjan  et  al.  and  Picheny  et  al.  are  the  most  promising.  Contour 
estimation  may  be  used  to  focus  samples  in  the  region  of  interest,  i.e.,  configurations  that  are 
expected  to  experience  small  moments  at  likely  flight  conditions.  In  addition,  both  selected 
approaches  incorporate  factors  which  dissuade  clustering  of  points.  Clustered  points  are 
informative  when  the  goal  is  to  find  an  optimum  response  value,  but  when  the  goal  is  to  fit  an 
useful  surrogate  model  throughout  a  design  space,  clustered  points  are  likely  redundant  and 
potentially  a  waste  of  effort. 

Note  that  the  selected  sampling  techniques,  as  well  as  many  others,  incorporate  prediction  un¬ 
certainty  in  the  evaluation  criterion.  This  uncertainty  is  not  a  function  of  the  accuracy  of  the 
data  points,  but  rather  stems  from  the  estimation  of  model  parameters.  Each  data  point  in  the 
training  set  is  considered  to  be  a  precise  representation  of  the  true  response  for  the  appropriate 
input  values.  When  applying  adaptive  sampling  to  data  at  multiple  levels  of  fidelity,  knowledge 
of  the  likely  accuracy  of  each  data  point  can  affect  sampling  behavior.  Eor  example,  linear 
aerodynamics  tools  are  moderately  trustworthy  at  low  angles  of  attack,  but  neglect  the 
nonlinear  effects  that  become  important  at  higher  angles  of  attack.  Both  low  and  high  angles  of 
attack  would  have  to  be  sampled  repeatedly  to  identify  this  pattern,  as  it  may  not  be  possible  to 
exactly  quantify  the  discrepancy  in  advance. 
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Having  addressed  both  adaptive  sampling  techniques  and  multi-fidelity  methods,  two  of  the 
topics  of  interest  remain;  quantification  of  uncertainty  inherent  in  the  data,  and  means  of 
incorporating  that  knowledge  when  creating  a  surrogate  model.  These  topics  will  both  be 
addressed  in  the  next  section. 

5.8  Quantifying  and  incorporating  Uncertainty 

To  understand  how  uncertainty  can  be  quantified  and  addressed,  the  most  obvious  source  of 
information  is  the  way  that  it  has  been  addressed  in  the  past.  Two  case  studies,  the  Space  Shuttle 
and  the  X-33,  were  used  to  illustrate  how  vehicle  design  programs  evaluated  and  accounted  for 
uncertainty  in  the  data.  These  case  studies  will  provide  an  initial  understanding  of  the  subject. 

5.8.1  Case  Study:  Space  Shuttle 

The  Space  Shuttle  was  the  first  large-scale  effort  to  quantify  and  document  pre-flight 
uncertainty  with  respect  to  vehicle  performance  measures.  The  extensive  public-record 
documentation  of  that  effort  makes  the  Shuttle  program  an  excellent  resource.  Unlike 
previous  programs,  the  Shuttle  did  not  pursue  a  development  program  based  on  a  gradual 
expansion  of  the  vehicle’s  envelope.  Instead,  the  program  leapt  from  low-speed  glide  tests  to  a 
manned  orbital  mission  and  reentry.  This  challenging  program  was  driven  by  incentives  to 
minimize  testing  costs  and  duration.  Instead  of  flight  testing,  understanding  of  the  Shuttle’s 
performance  relied  on  one  of  the  most  extensive  wind  tunnel  testing  regimens  in  history.  [199] 

In  order  to  prepare  for  the  initial  orbital  mission  and  subsequent  reentry,  NASA  needed  to 
quantify  the  uncertainty  in  the  aerodynamic  data  base  for  the  vehicle.  Aerodynamic  uncertainty, 
coupled  with  the  proposed  vehicle  flight  control  system,  could  be  used  to  identify  conditions 
at  which  the  vehicle  had  minimal  or  negative  control  margin.  When  aerodynamic  uncertainty 
resulted  in  a  risk  of  compromised  mission  performance,  the  design  team  had  the  choice  of 
either  requesting  additional  ground  testing  (to  reduce  prediction  uncertainty)  or  adjusting  the 
reentry  trajectory  to  avoid  the  troubling  flight  conditions.  Through  use  of  this  process.  Shuttle 
engineers  were  able  to  increase  confidence  in  mission  performance. 

The  Shuttle  aerodynamic  uncertainty  was  quantified  in  a  number  of  ways.  The  first  and  most 
direct  was  the  repetition  of  certain  tests  using  multiple  facilities,  models,  and  sets  of 
instrumentation.  Such  repetition  helped  the  analysts  to  estimate  the  variation  in  responses  that 
was  related  to  the  testing  equipment  and  procedures  rather  than  true  vehicle  behavior.  Those 
variations  were  between  performance  results  at  similar  levels  of  fidelity,  and  were  called 
tolerances.  The  Shuttle  had  to  be  able  to  cope  with  prediction  errors  of  this  scale  without 
affecting  overall  vehicle  performance.  [198] 

In  addition  to  determining  tolerances.  Shuttle  aerodynamicists  had  to  estimate  the  magnitude 
of  the  discrepancies  between  wind  tunnel  predictions  and  in-flight  behavior;  those 
discrepancies  referred  to  as  variances.  Variances  would  be  known  with  confidence  after  flight 
testing,  but  early  approximations  were  required  to  identify  risky  conditions  and  plan 
accordingly.  [198]  These  approximations  were  drawn  from  previous  flight  test  experience 
available  in  the  public  literature.  No  single  vehicle  matched  the  Shuttle’s  configuration  or  wide 
range  of  flight  regimes.  Instead,  estimates  of  variances  were  drawn  from  the  published  pre¬ 
flight  and  post-flight  aerodynamic  comparisons  for  a  number  of  vehicles,  selected  to  match 
particular  aspects  of  the  Shuttle  such  as  a  delta  wing  planform,  the  use  of  wing  flaps  for 
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longitudinal  control,  the  use  of  elevons  for  lateral  control,  a  single  vertical  tail,  and  a  relatively 
large  fuselage. [59]  No  one  vehicle  satisfied  all  similarity  charaeteristies. 

Vehieles  were  ehosen  based  on  the  degree  of  similarity  and  the  availability  of  both  pre-  and  post- 
flight  performanee  data.  These  vehieles  ineluded  commereial  aircraft  (Concorde),  military 
serviee  aireraft  and  prototypes  (B-58,  XB-70,  YF-12),  and  researeh  vehieles  (X-15,  X-24B, 
M2-F3,  HL-  10).  Although  the  X-15  was  not  thought  to  share  many  geometrie  eharaeteristics 
with  the  Orb  iter,  it  was  ineluded  as  a  referenee  due  to  the  paueity  of  hypersonie  data.  [190]  The 
eomparisons  between  the  aerodynamie  performanee  estimated  before  flight  and  the  data 
eolleeted  during  flight  testing  and  operation  of  the  vehieles  provided  insight  into  the  likely 
aeeuraey  of  the  Shuttle  pre-flight  database. 

Weil[190]  uses  the  M2-F3  lifting  body  to  illustrate  the  proeess  of  estimating  variations.  At 
Maeh  1.1,  the  predieted  values  of  C„^were  eompared  over  a  range  of  angles  of  attaek.  The 
maximum  deviation  between  predieted  and  flight-measured  values  was  identified  after  any  elear 
outliers  are  removed.  A  trend  in  deviation  behavior  was  diseernible,  growing  larger  as  angle  of 
attaek  inereased.  Weil  eautions  that  “eare  must  be  exereised  to  limit  this  variation  in  regions 
of  rapidly  ehanging  oharaeteristies”  but  does  not  go  into  detail  how  this  might  be 
aecomplished.  This  proeess  was  repeated  for  other  Maeh  numbers  and  for  other  vehieles.  The 
resulting  discrepancies  were  grouped  by  Mach  number  only  -  the  eflfeets  of  angle  of  attaek  or 
sideslip  were  aggregated  -  and  then  uneertainty  limits  were  defined  based  on  engineering 
judgment,  rather  than  statistieal  analysis. 

Expert  opinion  played  a  large  role  in  the  development  of  the  variances:  it  directed  the  seleetion 
of  oharaeteristies  that  would  define  useful  referenee  vehieles,  it  direoted  deoisions  as  to 
whether  individual  vehieles  were  suflhoient  similar  to  serve  as  referenoes,  and  it  was  the 
foundation  for  the  final  varianoe  values  in  the  uncertainty  model. 

Test  flights  of  the  Shuttle  demonstrated  that  even  oonservative  varianoe  estimates  did  not  always 
enoompass  the  true  vehiole  performanee.  After  the  first  flight  test,  it  was  found  that 
hypersonie  trim  at  high  angle  of  attack  required  a  significantly  larger  deflection  of  the  body 
flap  than  was  planned.  Although  the  disorepanoy  in  pitching  moment  ooelfioient  was  only 
approximately  0.03,  correoting  the  discrepancy  required  16°  of  deflection  rather  than  7°, 
leaving  less  than  one-third  of  the  expeoted  oontrol  margin  for  maneuvers  or  oontrolling 
dispersions.  Later  testing  suggested  that  real  gas  eflfeots,  whioh  were  not  extensively  simulated 
pre-flight,  were  responsible  for  the  majority  of  the  discrepancy.  [86,  133]  This  supports  the 
expeetation  that  gaps  in  the  set  of  referenee  vehieles  -  in  this  ease,  a  laek  of  lifting 
eonfigurations  with  data  at  double-digit  Maeh  numbers  -  may  result  in  an  insulficient 
understanding  of  likely  varianees. 

5.8.2  Case  Study:  X-33 

A  similar  approaeh  to  varianoe  estimation  was  used  in  the  modeling  of  aerodynamie 
uneertainty  of  the  X-33  vehiole,  a  proposed  single-stage -to-orbit  demonstrator.  [32]  The  Shuttle, 
along  with  six  lifting-body  vehicles,  were  seleoted  as  referenee  vehicles  due  to  similarity  of 
geometry  and/or  flight  regimes.  All  seleoted  vehieles  also  offered  sulfioient  dooumentation  of 
the  eomparisons  between  pre-flight  aerodynamie  prediotions  and  flight  test  data.  The  X-33 
aerodynamie  uneertainty  model  fooused  on  varianees  rather  than  toleranoes,  and  effort  was  taken 
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to  make  the  model  applieable  to  any  lifting  body  vehiele.  Note  that  the  estimated  uneertainty  is 
again  a  one-dimensional  funetion;  the  uneertainty  assoeiated  with  other  parameters,  sueh  as 
angle  of  attaek,  is  aggregated  into  a  quantity  that  is  purely  a  funetion  of  Maeh  number. 

Despite  the  quantity  of  data  available,  expert  opinion  onee  again  played  a  large  role  in  the  final 
X-33  uneertainty  model.  First,  expert  opinion  onee  again  drove  the  proeess  of  seleeting 
vehieles  to  serve  as  referenees.  In  addition  to  having  publiely  doeumented  pre-  and  post-flight 
aerodynamie  databases,  the  ehosen  vehieles  were  believed  to  experienee  flow  behaviors 
similar  to  what  the  X-33  would  eneounter,  sueh  as  nonlinear  flow  and  vortex  shedding.  Expert 
knowledge  was  required  to  identify  flow  behaviors  that  will  be  relevant  to  the  eonfiguration, 
as  well  as  seleeting  existing  vehieles  whieh  were  subjeeted  to  those  flow  phenomena. 

Seeondly,  deeisions  were  made  to  ineorporate  some  data  points  from  the  referenee  programs, 
modify  others,  and  exelude  some  entirely.  Beeause  the  Shuttle  had  the  benefit  of  an  extensive 
wind  tunnel  testing  regimen,  its  pre-flight  predietions  had  less  uneertainty  than  would  be 
likely  for  a  smaller  program.  To  aeeount  for  this,  parts  of  the  X-33  uneertainty  envelope  were 
enlarged  relative  to  Shuttle  varianees.  Similarly,  the  X-33  uneertainty  envelope  was  made  to 
eneompass  some  of  the  lifting  body  data  points  while  exeluding  others,  signifying  a  deeision 
that  the  exeluded  data  points  were  in  some  way  inaeeurate  or  unimportant.  Both  the  ehoiee  of 
values  to  be  modified  and  the  degree  to  whieh  they  were  modified  were  deeisions  made  by  X- 
33  aerodynamieists. 

5.8.3  General  Approach:  Estimation  of  Uncertainty 

The  uneertainty  quantifieations  deseribed  in  the  ease  studies  were  based  on  two  eonsiderations. 
First,  the  tolerances  were  the  estimated  uneertainty  at  a  partieular  level  of  fidelity  -  based  on 
CFD  results,  how  aoeurately  ean  other  CFD  results  be  predieted?  Seeondly,  the  variances 
were  the  potential  diserepaneies  between  different  levels  of  fidelity  -  based  on  CFD  results, 
how  aoeurately  ean  flight  performanoe  be  predioted? 

For  this  applioation,  the  estimation  of  aerodynamie  toleranoes  would  depend  on  a  number  of 
data  souroes.  First,  data  that  is  generated  by  an  iterative  solver  ean  be  interrogated  to 
determine  solution  oonvergenoe.  For  this  partieular  problem,  the  iteration  history  from  Cart3D 
may  be  used  as  a  measure  of  the  variation  in  eaoh  response.  Rather  than  defining  a  response  as 
the  value  of  a  oertain  ooeffioient  after  the  final  iteration,  it  may  be  more  eflfeotive  to  observe 
the  mean  value  over  some  number  of  iterations,  and  use  the  standard  deviation  over  those 
iterations  as  a  measure  of  the  noise  for  that  ease.  The  seripts  eurrently  in  use  to  run  Cart3D 
eould  be  easily  modified  in  this  manner.  This  would  provide  an  individual  estimate  of 
uneertainty  for  eaoh  Cart3D  data  point. 

Seeondly,  for  the  three  multi-fidelity  teohniques  whieh  handle  eaoh  souroe  of  data  separately, 
the  low-fidelity  data  itself  will  be  a  souroe  of  uneertainty  For  this  effort,  the  low-fidelity  data 
will  be  produoed  by  APAS.  APAS  runs  fairly  rapidly,  on  the  order  of  1-2  seoonds  per  ease,  but 
even  that  eould  add  hours  to  the  time  required  to  seleot  a  sample  if  many  options  are 
oonsidered.  Instead,  a  surrogate  model  of  APAS  results  will  be  generated  so  that  thousands  of 
low-fidelity  response  values  ean  be  estimated  in  the  time  it  would  take  to  run  APAS  onee. 
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This  additional  surrogate  model,  unless  it  is  a  perfeet  representation,  will  only  approximate  the 
low-fidelity  results.  The  goodness-of-fit  cheeks  which  quantify  a  surrogate  model’s  performance 
include  model  representation  error  (MRE),  a  measure  of  how  well  the  surrogate  predicts  response 
values  for  cases  that  were  not  part  of  the  training  process.  [3 6]  MRE  can  be  considered  an 
estimate  of  how  accurate  the  surrogate  will  be  when  used  to  predict  response  values  at  points 
where  the  true  analysis  has  not  been  run.  It  is  typically  cited  as  a  mean  and  a  standard 
deviation  of  prediction  error.  The  standard  deviation  can  be  squared  to  produce  the  error 
variance,  which  can  then  be  treated  as  the  prediction  variance  of  the  low-fidelity  surrogate 
model  in  the  absence  of  more  precise  estimates.  These  two  sources  of  data  -  the  iteration  noise 
from  CartSD  analysis  and  the  prediction  uncertainty  from  the  surrogate  model  of  APAS  results 
-  shall  make  up  the  aerodynamic  tolerances  for  the  purposes  of  these  experiments. 

The  process  for  estimating  aerodynamic  variances  -  the  uncertainty  introduced  when  using 
results  at  one  level  of  fidelity  to  estimate  response  behavior  at  some  other  level  of  fidelity  - 
is  more  complex.  Eor  both  the  Space  Shuttle  and  X-33,  expert  opinion  was  used  to  select 
vehicles  which  could  offer  useful  comparisons.  The  Shuttle  is  an  especially  good  example  of 
the  potential  diflhculties.  None  of  the  reference  vehicle  satisfied  all  five  of  the  similarity 
criteria.  Additionally,  some  reference  vehicles  only  had  data  available  for  a  limited  set  of 
flight  conditions.  This  was  not  unexpected;  the  Shuttle  was  significant  departure  from  what 
had  been  done  before,  and  thus  a  comprehensive  suite  of  reference  vehicles  was  not  likely  to 
exist. 

Although  the  X-33  uncertainty  database  was  intended  to  be  applied  to  both  wind  tunnel  and 
computational  model  results,  it  did  not  specify  the  type  of  computational  model  that  is 
expected.  This  may  be  considered  a  shortcoming,  as  the  prediction  accuracy  of  computational 
models  will  vary  greatly  with  model  fidelity.  This  type  of  knowledge  would  be  useful  when 
designing  the  experimental  plan;  if  it  is  known  that  a  simple  model  will  have  sufficient 
prediction  accuracy  at  a  particular  flight  condition,  the  designer  will  not  need  to  apply  more 
expensive  models  for  confirmation.  Such  an  approach  would  rest  on  the  ability  to  accurately  and 
dependably  quantify  the  accuracy  of  a  given  model  at  various  flight  conditions. 

Recall  again  from  Section  1 .6  the  calls  for  treating  model  validation  as  an  ongoing  process,  one 
that  is  repeated  and  extended  as  necessary  each  time  the  too  is  applied  to  a  different 
application. [8,  138]  If  each  computational  model  is  validated  over  the  relevant  ranges  of 
flight  conditions,  the  user  will  have  obtained  a  pool  of  data  which  can  be  used  to  estimate  each 
model’s  likely  accuracy  across  the  proposed  vehicle’s  flight  regime.  This  pool  of  data 
quantifies  the  accuracy  for  each  validation  case;  carefully  chosen  validation  cases  will 
therefore  allow  the  estimation  of  variances  for  the  configuration  of  interest. 

Note  that  the  Shuttle  and  X-33  uncertainty  databases  modeled  the  variance  as  a  single  value  per 
Mach  number.  This  reflects  the  assumption  that  aerodynamic  data  would  come  from  wind 
tunnel  testing  and  possibly  computer  modeling  at  an  equivalent  level  of  fidelity.  Early  in  the 
design  process,  aerodynamic  data  may  have  many  sources  due  to  the  variety  of  analysis  tools 
available.  Each  source  may  have  different  prediction  confidence  at  each  flight  condition  of 
interest.  If  each  data  source  is  validated  against  relevant  cases,  the  likely  accuracy  of  each  data 
point  may  be  estimated. 
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For  this  particular  etFort,  variances  would  be  less  important.  The  seope  of  the  researeh  did  not 
offer  the  opportunity  to  aequire  flight  test  data  for  a  reusable  booster  system,  so  CFD  analyses 
would  be  the  highest-fidelity  data  source  available.  As  a  result,  it  was  not  eonsidered  neeessary 
to  estimate  how  well  the  CFD  results  would  mateh  physieal  measurements.  Instead,  the  CFD 
results  would  act  as  truth  data  for  these  experiments.  Furthermore,  it  was  already  established  that 
the  discrepaneies  between  the  lower-fidelity  APAS  results  and  the  CFD  results  would  be 
aeeounted  for  as  part  of  the  toleranees.  Thus,  varianees  would  not  play  a  signifieant  role  in 
this  effort,  although  they  are  eertainty  important  to  take  into  aeeount  when  designing  a 
revolutionary  vehiele.  Still,  varianees  were  investigated  to  a  reasonable  degree  during  the  eourse 
of  this  researeh. 

The  uneertainty  databases  as  deseribed  were  intended  for  use  in  low -risk  trajeetory  design  and 
the  testing  of  flight  control  software.  If  sueh  information  were  available  earlier  in  the  design 
proeess,  it  would  reduee  the  risk  of  seleeting  a  defieient  design  by  quantifying  the  confidenee  of 
performanee  predietions.  Knowledge  of  the  predietion  eonfidenee  at  eaeh  flight  eondition 
would  also  increase  eomputational  efficiency  by  focusing  the  use  of  expensive  simulations  on 
oases  whioh  most  benefit  from  the  inoreased  prediction  confidence. 

5.8.4  Incorporating  Uncertainty 

The  final  topio  to  be  addressed  is  the  inoorporation  of  the  uneertainty  information  into  the 
analysis  proeess.  Given  that  the  adaptive  sampling  methods  whioh  are  best-suited  to  the 
problem  at  hand  are  Kriging-based,  it  would  seem  reasonable  to  foous  the  searoh  on 
teohniques  to  inoorporate  un-  eertainty  into  Kriging  models.  Reoall  from  Seotion  5.7  that 
predietion  varianoe  goes  to  zero  at  the  training  points  if  those  points  are  deterministic. 

If  the  training  points  are  not  deterministio,  i.e.,  there  is  some  uncertainty  as  to  the  exaot  value 
of  the  response  at  eaeh  point,  a  nugget  parameter  may  be  used  to  quantify  the  response 
uneertainty  at  that  point.  [93]  The  nugget  parameter  is  a  sealar  or  veetor  of  values  that  are 
added  to  the  diagonal  of  the  covariance  matrix  when  fitting  a  Kriging  model;  the  nugget 
magnitude  eontrols  how  elosely  the  Kriging  model  will  reproduee  the  training  data,  with 
larger  nuggets  eorresponding  to  looser  fits  of  the  data.  Nuggets  were  ineluded  in  some  of  the 
earliest  Kriging  formulations,[l  19]  and  were  intended  to  eapture  measurement  error  and  small- 
scale  variations  in  the  response  as  part  of  geostatistieal  modeling.  [93]  Even  when  modeling 
the  output  of  deterministio  (i.e.,  noiseless)  eomputational  tools,  some  researchers  advooate 
the  addition  of  small  nugget  terms  to  covarianoe  matrioes  to  improve  numerioal  stability  and 
reduee  the  risk  of  a  poorly  oonditioned  oovarianoe  matrix. [67,  107,  160,  197] 

One  oonsequence  of  adding  a  nugget  to  the  oovarianoe  matrix  is  that  the  resulting  Kriging  model 
will  no  longer  be  an  exaot  interpolator:  it  will  not  exaotly  reproduee  the  training  data  set 
when  used  to  prediot  the  response  at  a  known  point. [39]  Very  small  nuggets,  on  or  olose  to  the 
order  of  maohine  precision,  will  not  affect  the  interpolation  behavior  significantly.  [107] 
Gramacy  and  Lee  also  argue  that  nuggets  improve  the  statistioal  properties  of  the  emulator  in 
oases  of  sparse  data  or  when  modeling  assumptions  sueh  as  the  oorrelation  type  are 
inoorreot.[68]  The  larger  the  nugget  values,  the  more  elosely  the  model  will  follow  the 
estimated  mean  function.  [182] 
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Overall,  the  ability  of  a  vector  of  nuggets  to  capture  any  uncertainty  that  varies  between  samples, 
as  well  as  the  compatibility  of  nuggets  with  the  Kriging  formulations  that  support  many 
adaptive  sampling  approaches  in  the  literature,  made  the  use  of  nuggets  very  attractive  as  a 
means  of  incorporating  uncertainty  into  the  surrogate  modeling  process. 

5.9  Review  of  Research  Questions  and  Formuiation  of  Hypotheses 

As  a  result  of  the  literature  search  that  was  driven  by  the  research  questions,  multiple 
techniques  have  been  identified  that  may  improve  designers’  ability  to  apply  high-fidelity 
modeling  earlier  in  the  preliminary  design  phase.  The  major  factor  which  stymied  earlier 
attempts  was  the  significant  increase  in  modeling  expense  when  more  complex  data  sources 
are  used.  A  research  effort  was  undertaken  to  identify  alternative  techniques  and  methods 
which  would  help  to  reduce  this  expense  and  bring  such  modeling  into  the  realm  of  feasibility. 

The  first  focused  research  question  was: 

When  “good  performance  ”  refers  to  responses  within  desirable  ranges  rather  than  maxima  or 
minima,  how  can  regions  of  good  performance  be  identified  and  emphasized  during  the 
sampling  process? 

In  light  of  the  described  adaptive  sampling  approaches,  the  contour-based  sampling  approaches 
described  by  Picheny  et  al.[149]  and  Ranjan  et  al.[159]  should  allow  the  preferential  selection 
of  cases  with  near-zero  moments.  This  is  the  basis  for  the  first  hypothesis: 

Hypothesis  1:  Contour-based  sampling  will  balance  the  selection  of  cases  with  good 
performance  and  the  reduction  of  prediction  uncertainty  in  promising  regions,  identifying 
samples  that  efficiently  improve  surrogate  accuracy  for  configurations  with  small  aerodynamic 
moments. 

The  savings  in  computational  effort  from  contour-based  sampling  alone  is  not  expected  to  re¬ 
duce  modeling  costs  sufficiently  to  enable  large  scale  application  of  expensive  models.  For 
further  reduction  in  cost,  low-fidelity  data  will  be  used  as  a  source  of  cheaper  data,  providing 
estimates  for  the  responses  of  interest  that  will  be  corrected  by  more  accurate  models. 

The  next  focused  research  question  spurred  the  investigation  of  techniques  that  would  allow  the 
low-fidelity  estimates  to  be  blended  with  data  from  the  more  trusted  models: 

How  can  cheaper  analyses  be  integrated  with  high-fidelity  models  to  reduce  the  overall  cost  of 
design  space  exploration  or  exploitation? 

After  a  review  of  the  available  techniques,  a  number  of  possible  methods  were  identified  that 
might  improve  predictive  performance  and  reduce  dependence  on  expensive  data  sources. 
Multi-fidelity  modeling  is  abundant  in  the  literature,  but  it  remains  to  be  seen  which  of  the 
methods  identified,  if  any,  will  be  effective  for  the  problem  at  hand.  Rather  than  selecting  one 
now,  the  choice  will  be  deferred  until  comparisons  between  the  methods  can  be  made  for  one  or 
more  representative  problems.  Thus,  the  second  supporting  hypothesis  is  couched  in  more 
general  terms: 
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Hypothesis  2:  Data  fusion  techniques  will  allow  results  from  high-fidelity  analyses  to  be 
augmented  with  cheaper  sources  of  data  to  produce  surrogate  models  that  are  more  accurate  yet 
require  less  computationally-expensive  data. 

The  last  focused  research  question  highlighted  uncertainty  in  the  data  and  sought  to  determine 
how  that  knowledge  might  be  incorporated: 

How  can  information  about  uncertainty  in  the  data  be  captured  effectively? 

It  was  noted  that  one  of  the  multi-fidelity  techniques  that  had  been  identified,  data 
harmonization,  serendipitously  also  included  an  explicit  mechanism  to  incorporate  uncertainty: 
the  Kriging  nugget.  If  Kriging  models  are  used  to  implement  the  other  data  fusion  techniques 
as  well,  nuggets  can  be  used  to  extend  each  of  those  methods  in  turn  so  that  each  one  can 
integrate  data  from  multiple  sources  and  capture  any  uncertainty  in  those  data  points.  This  leads 
to  the  next  hypothesis: 

Hypothesis  3:  When  creating  a  Kriging  model,  the  use  of  nuggets  will  capture  uncertainty  in 
the  data,  improving  predictive  accuracy  for  noisy  responses. 

Supporting  this  goal  are  the  various  techniques  which  may  be  used  to  quantify  the  uncertainty  in 
data  points.  These  techniques  include  interrogation  of  response  history  for  iterative  models,  the 
use  of  validation  experiments  to  assess  expected  model  accuracy,  and  targeted  validation 
experiments  to  quantify  the  likely  discrepancies  between  different  sources  of  data. 

Together,  the  selected  techniques  should  serve  to  produce  more  accurate  surrogate  models  while 
reducing  the  cost  necessary  to  generate  the  required  data,  addressing  the  primary  research 
question: 

How  can  high-fidelity  modeling  be  feasibly  applied  earlier  in  the  design  process,  despite  the 
computational  expense? 

The  three  previous  hypotheses  come  together  to  address  this,  the  overarching  goal  of  the  re¬ 
search,  with  a  final  hypothesis: 

Hypothesis  4:  By  placing  samples  intelligently,  reducing  dependence  on  the  expensive  models, 
and  accounting  for  any  uncertainty  in  the  data,  the  selected  methods  will  enable  improved 
surrogate  model  accuracy  with  significantly  reduced  data  requirements,  such  that  high-fidelity 
modeling  becomes  a  feasible  option  earlier  in  the  design  process. 

This  hypothesis  represents  a  proposed  procedure  for  sample  selection  and  surrogate  model 
creation.  The  first  three  hypotheses  will  provide  guidance  as  how  certain  steps  of  the  process 
might  best  be  carried  out.  The  final  hypothesis  asserts  that,  by  approaching  each  step  of  the 
method  with  appropriate  techniques,  the  result  -  surrogate  models  for  cases  of  interest  -  will 
have  quantifiably  better  predictive  accuracy  than  what  is  currently  possible  with  standard 
techniques.  In  essence,  the  final  hypothesis  corresponds  to  a  proposed  approach  to  the  problem. 

The  approach  being  proposed  is  intended  for  problems  for  which  standard  sampling  and 
surrogate  modeling  techniques  would  result  in  excessive  costs,  particularly  in  the  form  of 
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execution  time  and/or  computational  elfort.  In  particular,  the  proposed  approach  will  improve 
efficiency  by  maximizing  the  information  gained  from  each  expensive  analysis  while 
minimizing  the  number  of  such  analyses  that  will  be  required. 

5.9.1  Steps  in  the  Method 

The  method  being  proposed  will  assume  that  the  user  has  already  set  up  the  problem.  That 
process  entails  a  number  of  activities,  such  as  identifying  the  independent  and  dependent 
variables  that  will  be  included,  selecting  appropriate  data  sources,  and  validating  each  data 
source  against  “truth  data”  to  quantify  its  expected  accuracy.  This  description  will  also  assume 
that  two  data  sources  will  be  used,  one  being  more  accurate  but  having  much  higher  per- 
analysis  costs,  and  the  other  being  less  costly  but  lacking  the  necessary  accuracy  for  use  as  the 
only  data  source.  Depending  on  this  cheaper  data  source’s  speed  of  execution,  the  user  may  wish 
to  replace  it  with  a  surrogate  model.  Given  these  stipulations,  a  general  description  of  the 
procedure  for  creating  surrogate  models  is  as  follows: 

Step  1:  Generate  an  initial  set  of  samples  to  be  analyzed. 

Prior  knowledge  about  the  behavior  of  the  responses  to  be  modeled  may  indicate  that  certain 
regions  of  the  response  space  are  of  greater  interest  than  others.  The  initial  samples  may  then 
be  distributed  to  emphasize  those  regions.  Barring  such  prior  knowledge,  space-filling  sample 
distributions  such  as  Latin  hypercubes  or  Sobol  sequences  are  often  appropriate. 

Step  2:  Analyze  the  samples  using  the  appropriate  data  sources. 

Step  3:  Train  Kriging  surrogate  models  using  the  resulting  data. 

It  should  be  noted  that  the  effectiveness  of  a  data  fusion  technique  will  depend  on  the  problem 
being  addressed.  Some  of  the  experimental  effort  in  this  research  will  be  dedicated  to 
identifying  which  of  the  selected  data  fusion  techniques  is  best  suited  for  the  intended 
application  of  modeling  reusable  booster  aerodynamics. 

Step  4:  Evaluate  the  resulting  surrogate  models  to  quantify  the  predictive  accuracy. 

There  are  two  primary  ways  to  quantify  the  predictive  accuracy  of  a  surrogate  model.  The  first 
method  is  to  set  aside  a  number  of  cases.  [88]  These  cases  are  not  used  to  train  surrogate 
models.  Instead,  the  surrogate  models  are  used  to  predict  the  response  values  for  those  cases, 
and  then  the  predicted  values  are  compared  against  the  observed  values.  The  discrepancy 
between  the  predicted  and  actual  values  is  then  used  to  assess  the  prediction  error. 

The  other  alternative  is  to  use  cross  validation.  [148]  When  performing  cross  validation,  all  of 
the  available  data  is  used  to  create  the  main  surrogate  models.  The  training  data  is  then  split  up 
into  subsets  and  a  number  of  temporary  surrogates  are  trained,  each  of  which  is  trained  using  all 
but  one  of  the  subsets.  Each  temporary  surrogate  is  then  used  to  predict  the  response  values  for 
the  subset  of  data  that  was  not  used  in  its  training.  Again,  the  discrepancy  between  the 
predicted  and  actual  values  is  used  to  assess  prediction  error.  Depending  on  the  amount  of  data 
available  and  the  level  of  effort  required  to  train  a  surrogate  model,  the  subsets  may  be  as  large 
as  20  percent  of  the  data  set  or  as  small  as  one  data  point.  [5  8,  75,  100]  This  approach  has  the 
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benefit  of  using  all  available  data  when  training  the  final  surrogate  models,  but  may  under¬ 
estimate  the  predietive  aeeuraey  of  those  surrogates. 


Figure  8:  Baseline  Process  for  Sample  Selection  and  Surrogate  Model  Creation 

Steps  5a  &  5b:  If  the  surrogate  models  are  sufficiently  accurate  or  the  project  resources 
have  been  consumed,  terminate  the  process, 
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Step  6:  Otherwise,  select  new  samples  for  analysis  and  go  to  Step  2. 


The  process  of  selecting  new  samples  can  be  thought  of  as  an  optimization:  the  samples  that 
are  selected  will  be  those  that  are  the  most  useful.  The  exact  definition  of  “most  useful,”  and 
thus  the  behavior  of  the  sample-selection  algorithm,  will  depend  on  the  choice  of  objective 
function.  Hypothesis  1  asserts  that  contour-based  sampling  will  be  effective  for  the  problem 
at  hand;  this  hypothesis  must  be  tested  before  being  accepted. 

For  the  purposes  of  this  work,  it  will  be  assumed  that  the  cost  of  obtaining  the  low-fidelity 
response  at  any  point  is  negligible.  The  sample  selection  process  will  therefore  only  be 
concerned  with  the  selection  of  high-fidelity  samples.  If  the  per-analysis  cost  of  the  low- 
fidelity  data  source  is  not  negligible,  it  is  recommended  that  the  low-fidelity  data  source  be 
replaced  with  a  separate  surrogate  model,  which  would  allow  the  low- fidelity  response  to  be 
predicted  without  the  associated  analysis  costs.  If  that  is  impossible  or  too  labor-intensive,  the 
reader  is  referred  to  the  work  of  Huang  et  ah, [83]  who  describe  an  adaptive  sampling  approach 
that  also  selects  the  best  data  source  to  use  for  each  new  sample  based  on  the  relative  costs  and 
benefits  of  each  data  source. 

With  respect  to  the  overall  focus  of  this  research  -  i.e.,,  the  creation  of  surrogate  models  & 
the  selection  of  subsequent  samples  to  improve  surrogate  accuracy  -  the  generic  process  is  laid 
out  graphically  in  Figure  8.  The  first  three  hypotheses  of  this  research  effort  pertain  to  how 
specific  steps  in  the  process  are  performed.  Hypothesis  1  addresses  Step  6,  asserting  that  a 
particular  adaptive  sampling  approach  will  yield  better  results  than  spending  the  full  set  of 
resources  on  space-filling  samples.  Hypotheses  2  and  3  address  Step  4,  emphasizing  how  the 
data  (including  validation  results  for  each  data  source)  will  be  used  to  create  surrogates.  The 
final  and  most  important  hypothesis.  Hypothesis  4,  asserts  that  when  the  process  is  carried  out 
using  the  recommended  techniques,  the  performance  of  the  resulting  surrogate  models  will  be 
better  than  the  surrogates  produced  by  the  baseline  approach  of  single-fidelity  modeling  and 
space-filling  samples. 

Although  each  hypothesis  was  constructed  based  on  known  research  and  evidence,  further 
experiments  must  be  performed  to  determine  whether  the  evidence  supports  or  undermines  these 
hypotheses.  Experimental  results  which  support  the  first  three  hypotheses,  due  to  the  hierarchical 
style  of  construction,  also  support  the  fourth  hypothesis.  The  experimental  plan  was  designed  to 
test  each  hypothesis  in  turn. 
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6  Evaluating  Contour-Based  Sampling 

The  expense  of  each  CartSD  analysis  is  not  negligible,  so  efforts  were  made  to  maximize  the 
re-  use  of  results.  To  this  end,  experiments  were  designed  so  that  the  cases  for  one  experiment 
would  be  useful  for  another.  This  is  particularly  true  of  the  space-filling  sample  designs  that 
exemplify  the  baseline  approach.  Points  selected  by  adaptive  sampling  were  expected  to  vary 
from  experiment  to  experiment,  with  little  chance  of  commonality.  Examples  of  this  will  be 
highlighted  in  the  description  of  the  experiments  as  appropriate. 

The  first  hypothesis  to  be  evaluated  asserts  that  contour-based  sampling  would  allow  accurate 
surrogate  models  to  be  trained  using  fewer  samples.  The  null  hypothesis  in  this  case  took  the 
form  of  a  space-filling  sampling  design.  Using  the  results  from  the  RBS  study,  three  flight 
conditions  were  selected  as  having  unusually  poor  fits  for  pitching  moment  relative  to  the  other 
models.  These  conditions  were:  Mach  0.3,  a  15°,  /3  0°;  Mach  0.8,  a  0°,  /3  0°;  and  Mach  2.5, 
or  0°,  j8  0°.  This  trio  of  conditions  posed  the  most  difficulty  for  the  space-filling  sampling 
approach,  and  thus  these  models  were  most  in  need  of  improvement. 

The  overall  objective  is  the  generation  of  surrogate  models  with  good  predictive  accuracy  for 
vehicles  with  desirable  performance,  i.e.,  small  aerodynamic  moments.  If  only  one  flight 
condition  were  being  considered,  the  problem  would  be  simple,  as  only  three  responses  must 
be  addressed.  For  a  more  general  problem,  it  is  possible  that  each  of  the  three  responses  might 
exhibit  very  different  behavior:  a  vehicle  with  near-zero  pitching  moment  at  a  small  angle  of 
attack  might  exhibit  a  large  pitching  moment  at  larger  angles  of  attack.  However,  the  problem  at 
hand  only  included  cases  with  no  sideslip,  and  so  the  two  lateral  responses  were  expected  to  be 
close  to  zero.  It  was  therefore  expected  that  a  concept’s  ability  to  trim  for  these  conditions  would 
be  dominated  by  its  pitching  moment,  and  so  the  primary  focus  of  the  experiment  was  to 
demonstrate  accurate  &  efficient  modeling  of  vehicle  pitching  moment. 

6. 1  Conceptual  Description  of  Sampling  Algorithm 

The  sampling  algorithm  was  derived  from  the  one  proposed  by  Picheny  et  al.[149]  in  2010. 
That  algorithm  was  intended  for  problems  with  one  response.  As  part  of  this  research  effort,  the 
method  was  expanded  to  handle  multiple  responses.  The  expanded  algorithm  is  as  follows: 

1 .  Create  Kriging  models  of  the  current  data  points  and  the  response  values  at  those  points. 

2.  Generate  a  set  of  candidate  points. 

3 .  Generate  a  set  of  test  points . 

4.  Use  the  current  Kriging  model  to  estimate  the  response  values  and  the  prediction 
variances  for  the  candidate  and  test  points. 

5 .  For  each  candidate  point,  calculate  how  adding  that  point  to  the  training  data  set  would 
change  the  Kriging  prediction  variance  at  each  test  point.  Combine  the  resulting  test 
point  prediction  variances  in  a  weighted  sum.  The  value  of  the  weighting  function 
will  vary  based  on  the  likelihood  that  the  test  point  has  a  response  near  the  threshold  of 
interest. 

6.  Repeat  step  5  for  each  response. 
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7.  Normalize  the  weighted  sums  for  eaeh  response.  The  eandidate  point’s  “seore”  is  the 
average  of  its  normalized  weighted  sum  results. 

8.  Seleet  the  eandidate  with  the  lowest  seore  as  the  next  ease  to  be  sampled. 

Steps  1-8  are  repeated  until  the  Kriging  model  is  suflheiently  aeeurate  or  all  resourees  have 
been  eonsumed.  The  speeifies  of  this  algorithm  rely  on  the  mathematieal  details  of  the 
Kriging  teehnique,  whieh  shall  be  reviewed  briefly.  More  detailed  information  may  be  found  in 
Journel  &  Huijbregts,[93]  Saeks  et  al.[170],  or  Lophaven  et  al.[108] 

6. 2  Re  view  of  Kriging  Ma  thematics 

This  work  was  based  on  the  form  of  Kriging  known  as  “universal  Kriging,”  whieh  models  as  a 
eombination  of  a  set  of  basis  functions  and  a  zero-mean  Gaussian  process.  The  basis  functions 
describe  the  overall  behavior  of  the  response  in  a  manner  similar  to  response  surface 
equations,[134]  while  the  zero-mean  Gaussian  process  captures  any  departures  from  the  large- 
scale  behavior  captured  by  the  basis  functions.  The  best  linear  unbiased  predictor  for  the 
response  y{x)  at  some  unsampled  point  x,  given  a  set  of  n  other  observations  Y  defined  by  p 
input  parameters,  can  be  calculated  as: 

mSx)  =  f{xYfi^c{xYC-^{^-Ffi)  (1) 

Here,  f{x)  is  a  (/>-l-l)xl  vector  of  basis  functions,  p  is  a  (/>  +  l)xl  vector  of  estimated 
coefficients,  c(x)  is  an  «  x  l  vector  of  covariance,  C  is  an  n  x  n  covariance  matrix,  and  F  is 
an  «x(j9-l-l)  experimental  matrix.  The  vector  of  estimated  coefficients  is  calculated  by  a 
generalized  least-squares  estimate: 


'f)  '7  (2) 

The  second  set  of  terms  to  the  right  of  the  equality  in  Equation  1  is  a  function  which  estimates 
how  much  deviation  from  the  underlying  behavior  can  be  expected  at  a  particular  point.  This 
deviation  is  calculated  based  on  the  estimated  correlation  between  the  point  in  question  and 
nearby  “known”  data  points.  A  number  of  different  correlation  functions  may  be  used  as  part 
of  Kriging  models,  with  the  Gaussian  function  seemingly  the  most  common  selection.  [107,  170] 
The  Gaussian  correlation  function  is  given  as: 


n 

k{u,v)  =  J^exp 

i=l 


^n,-v,p 

2 

1  J 

(3) 


Here,  i  refers  to  the  dimension  of  the  vectors  u  and  v.  If  distance  in  each  dimension  is 
weighted  equally,  every  6,  has  the  same  value  and  this  becomes  the  isotropic  Gaussian 
correlation  function.  The  more  general  case  of  an  anisotropic  function,  where  each  dimension  is 
weighted  independently,  will  be  assumed  for  the  following  work. 
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Using  the  correlation  function,  the  Kriging  prediction  variance  at  any  point  x  may  be  calculated 
as: 


sl{x)  =  k{x,x)-c{xY  C  'c(x) 

[fixf  -c{x)C-'F)(F^C-'Fy(f{xf  -C{xfc-'F) 


(4) 


In  this  equation,  k{x,  x)  is  the  process  variance;  c(x)  is  a  vector  of  covariances  between  the 
point  X  and  the  points  used  to  build  the  model;  C~^  is  the  inverse  of  the  matrix  of  covariances 
between  points  used  to  build  the  model;  /(x)  is  a  vector  of  basis  functions  describing  the  point 
x;  and  F  is  the  experimental  matrix  of  basis  functions  which  describe  the  points  used  to  build 
the  model.  The  covariance  function  is  equal  to  the  correlation  function  multiplied  by  the 
process  variance.  Basis  functions  refer  to  the  variables  chosen  to  describe  the  relevant  model 
factors.  For  example,  for  a  typical  two-dimensional  linear  model,  the  basis  functions  would 
be  1,  xi,  and  X2  to  capture  a  constant  mean  and  the  linear  effect  due  to  each  input  dimension. 


The  process  variance  is  a  measure  of  how  well  the  underlying  trend, /(x)^j8,  accounts  for 
the  observed  data.  When  the  trend  is  a  good  representation  of  the  response  behavior,  the 
process  variance  will  be  small  because  the  observed  responses  are  close  to  the  expected  trend. 

Once  the  trend  coefficients  /3  have  been  calculated,  the  process  variance  can  be  estimated  using 
the  observed  data  points:[l  17] 

=L{y-fYJ  r'(y-fY)  (5) 

n 


Note  that  the  second  set  of  terms  to  the  right  of  the  equality  in  Equation  4  captures  the  degree 
to  which  any  nearby  “known”  data  points  decrease  the  prediction  variance  at  the  desired  point 
X.  This  term  will  be  maximized  (and  prediction  variance  minimized)  when  the  covariance 
vector  c(x)  is  maximized.  Looking  back  at  the  Gaussian  correlation  function  in  Equation  3,  c(x) 
will  be  large  when  the  point  x  is  very  close  to  the  known  points  used  to  build  the  model. 
Essentially,  Kriging  prediction  variance  will  be  smallest  when  x  is  close  to  those  used  to  build 
the  model  and  larger  for  x  farther  away.  This  feature  of  Kriging  forms  the  foundation  of  the 
contour-based  sampling  algorithm. 

6.3  Mathematical  Formulation  of  Sampling  Algorithm 

The  sampling  algorithm  seeks  to  identify  the  point  that  will  most  reduce  the  prediction  variance 
of  the  Kriging  model  in  regions  where  the  response  is  near  some  value  of  interest.  Eor  the 
planned  application  of  this  method,  the  value  of  interest  is  a  pitching  moment  coefficient  of  zero 

and  with  a  range  of  interest  of  ±  0. 1 .  Configurations  outside  this  range  are  not  expected  to  be 
able  to  trim,  and  are  thus  likely  to  be  infeasible  from  a  vehicle  control  standpoint. 

The  algorithm  proceeds  in  the  following  manner:  first,  a  Kriging  model  is  fit  to  each  response 
of  interest.  There  will  be  at  least  one  response  of  interest  (pitching  moment  coefficient)  for 
every  flight  condition  that  will  be  evaluated.  The  second  and  third  steps  are  to  generate  a  set  of 
candidate  points  and  a  set  of  test  points.  These  points  may  be  generated  using  a  space-filling 
method  such  as  a  Latin  hypercube,  [126]  or  may  be  distributed  according  to  the  preferences  of 
the  user.  The  space-fdling  distribution,  being  the  more  general  approach,  was  assumed 
throughout  this  work.  One  of  the  candidate  points  must  be  chosen  as  the  next  point  to  be 
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sampled.  As  the  number  of  candidate  points  increases,  the  algorithm  will  have  more  options  to 
choose  from;  as  the  number  of  test  points  increases,  the  algorithm  can  more  accurately  assess 
each  candidate.  Enlarging  either  set  comes  at  a  cost  of  increased  analysis  time. 

The  fifth  step  is  to  calculate  the  change  in  prediction  variance  due  to  sampling  a  particular 
candidate  point  using  a  technique  known  as  weighted  integrated  mean  squared  error,  or 
wIMSE.[97,  149]  This  technique  leverages  the  ability  of  a  Kriging  model  to  estimate  the 
expected  response  value  and  prediction  variance  at  each  test  point.  Once  the  fifth  step  has  been 
repeated  for  each  response  of  interest,  the  final  step  is  to  normalize  the  scores  of  all  candidates 
for  every  response  and  select  the  candidate  which  promises  the  best  average  improvement  over 
all  responses.  This  fifth  step  is  clearly  the  heart  of  the  algorithm,  and  must  be  explained  in 
more  detail. 


6.3.1  Quantifying  How  Candidate  Points  Affect  Prediction  Variance 

Eirst,  the  n  data  points  making  up  the  original  model  are  temporarily  augmented  with  one  of 

the  candidate  points.  The  nXn  symmetric  covariance  matrix  C  is  thus  expanded  to  size  (n  +  1)  X 
(n  +  1 )  to  capture  the  covariance  of  this  candidate  point  with  each  of  the  n  existing  samples. 


Recall  from  Equation  4  that  the  Kriging  prediction  variance  depends  on  the  inverse  of  the  co- 
variance  matrix.  Inverting  the  full  covariance  matrix  once  for  each  candidate  point  is  possible 
but  inefficient:  because  only  the  n+  row  and  column  of  C  are  changing,  most  of  the  matrix 
to  be  inverted  will  remain  constant  no  matter  which  candidate  is  considered: 
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Picheny  et  al.  recommend  that  the  inverse  of  the  enlarged  matrix  be  calculated  using  Schur’s 
complement  formula,[201]  resulting  in  the  following  equation: 
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Because  C„ '  can  be  computed  once  and  re-used  for  every  candidate  evaluation,  this  approach 
re-places  a  matrix  inversion  operation  with  a  series  of  matrix  multiplication  operations, 
reducing  the  computational  cost  significantly.  To  quantify  the  cost  reduction,  notional  data  was 
generated.  This  data  had  forty-nine  input  dimensions,  similar  to  the  problem  of  interest.  Eising 
progressively  larger  sets  of  notional  data,  the  contour-based  sampling  algorithm  was  used  to 
evaluate  10  candidates  using  300  test  points.  Eor  a  large  parameter  space  such  as  this,  many 
more  test  points  would  be  required  to  accurately  evaluate  a  given  candidate;  given  that  this  test 
was  purely  to  assess  numerical  speed,  it  was  suflhcient  that  the  number  of  candidate  and  test 
points  be  consistent. 

The  first  algorithm  used  Schur’s  complement,  as  recommended  in  Picheny  et  ah,  while  the 
second  algorithm  explicitly  fit  a  new  Kriging  model  every  time  a  new  candidate  was 
evaluated.  The  time  to  evaluate  all  candidates  and  select  the  best  option  was  recorded.  This 
was  repeated  10  times  for  each  data  set.  The  average  time  required  for  the  two  methods  to 


60 

Approved  for  public  release;  distribution  unlimited 


select  a  sample  is  given  in  Table  2.  For  models  of  the  size  and  complexity  expected  in  this 
effort,  the  use  of  Schur’s  complement  was  found  to  reduce  the  sample  selection  time  by 
approximately  90  percent,  or  a  full  order  of  magnitude. 


Table  2:  Average  Sample  Selection  Speed  With  &  Without  Schur’s  Complement  (seconds) 


200 

Cases 

400 

Cases 

600 

Cases 

800 

Cases 

1,000 

Cases 

Using  Schur’s 
Complement 

22 

66 

133 

229 

342 

Direct  Matrix 
Inversion 

196 

674 

1,418 

2,405 

3,592 

Ratio 

9.03 

10.3 

10.7 

10.5 

10.5 

Using  this  approach,  the  inverse  covariance  matrix  C”^j  can  be  obtained  just  as  if  the  Kriging 
model  had  been  re-trained  to  include  the  candidate  point.  The  covariance  vector  c^^y{x)  for  a 
test  point  x  captures  the  covariance  of  x  with  each  of  the  samples  as  well  as  with  the  current 
candidate  point. 

The  final  step  in  assessing  the  effect  of  a  candidate  on  the  surrogate  is  to  append  the  basis 
function  representation  of  the  candidate  point  to  the  experimental  matrix  F .  Selection  of 
basis  functions  is  left  up  to  the  user.  This  effort  used  a  linear  model,  with  each  input  parameter 
forming  one  basis  function  and  an  extra  column  to  account  for  the  mean  value  of  the  response. 
For  a  problem  with  normalized  input  parameters  xi  and  X2,  the  linear  basis  function  values  for 
sample  k  would  be  [l  Xj  {k)  X2  (^)] . 

The  new  C“^i  and  matrices  are  then  plugged  into  Equation  4,  which  in  turn  allows  the 
calculation  of  prediction  variance  as  if  the  current  candidate  point  had  been  sampled  and  added 
to  the  model.  This  updated  variance  equation  is  then  used  to  calculate  the  prediction  variance  at 
each  test  point. 

Before  these  variances  are  combined,  however,  a  weighting  factor  is  applied.  This  weighting 
factor  is  large  for  test  points  with  responses  close  to  the  target  value  and  small  for  test  points 
with  responses  far  from  the  target  value.  Thus  -  and  this  is  the  heart  of  the  algorithm  -  the 
candidate  with  the  smallest  weighted  sum  of  variances  (i.e.,  prediction  uncertainty)  can  be 
considered  the  one  that  most  reduces  prediction  variance  for  cases  with  responses  close  to  the 
target  value. 

6.3.2  Weighting  Function  Calculations 

Picheny  et  al.  suggest  two  alternatives  for  the  weighting  function,  one  using  an  indicator 
function  and  one  using  a  Gaussian  density  function.  Because  the  problem  at  hand  features  a 

response  region  of  interest  with  clearly-defined  bounds  (e.g.,  0  ±  0.1),  the  indicator  function 
was  selected.  This  weighting  function  is  calculated  via: 
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W(x)  =  O 


(8) 
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Here,  (D  is  the  cumulative  distribution  function  (CDF)  of  the  standard  normal  distribution;  T 
is  the  target  response  value  (0);  £  is  the  half-width  of  the  range  of  interest  (0.1);  m«(x)  is  the 
predicted  response  value  at  point  x;  and  5„(x)  is  the  prediction  variance  at  point  x.  This 
prediction  variance  is  calculated  using  only  the  n  known  samples;  candidate  points  are  not 
included.  In  effect,  this  weighting  function  is  equal  to  the  probability  that  the  response  at  point 
X  has  a  value  between  T -s  and  T  +  s  . 

Because  Kriging  prediction  variance  is  assumed  to  have  a  Gaussian  distribution,  the  CDF  may 
be  calculated  analytically.  Zelen  &  Severe  provide  a  method  for  calculating  the  CDF  forx  > 
0;[1] 


(^{x)^\-(l){Z){b^t  +  b2t^  +b^t^  +bj:‘^  +b^t^^+s{Z),  i  =  ^ —  (9) 

1  “1“  b^Zi 

Here,  (t>{Z)  is  the  probability  density  function  (PDF)  of  the  standard  normal  distribution,  and 

ho  =  0.2316419,^1  =  0.319381530,^2  = -0.356563782,^3  =  1.781477937,Z)4  = 

— 1.821255978,  and  hs  =  1.330274429.  £{x)  is  the  discrepancy  between  this  approximation 
and  the  true  CDF  value,  with  ||f(x)||  <  7.5x10  * .  The  PDF  of  a  normal  distribution  is  given  as: 


(l>{y)  = 


1 

I  exp 


(y-i4 

2a-" 


(10) 


The  standard  normal  is  a  special  case  of  the  Gaussian,  or  normal,  distribution  for  which 
// ,  the  mean,  is  equal  to  zero  and  (7  ,  the  variance,  is  equal  to  one.  This  is  commonly  written 
as  N(0,1).  Any  normal  distribution  N(/i,cr^)  can  be  transformed  into  a  standard  normal 
distribution  using  the  following  equation: 


^  y-/^^ 


(11) 


Using  this  transformation.  Equation  10  becomes: 


^(Z) 


1 

■  {z-of 

1 - 

V2^(l)' 

2(1)"  _ 

(12) 


In  order  to  calculate  the  probability  that  the  response  at  point  x  is  less  than  the  upper  bound 
of  the  response  target  region  (0.1  for  pitching  moment),  y  in  Equation  1 1  would  be  replaced 
with  T  +  £.  Eikewise,  m„{x)  would  replace  jJ  and  s„{x)  would  replace  O.  The  resulting  Z  is 
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plugged  into  Equations  9  &  12  to  calculate  the  CDF  of  the  normal  distribution.  The  CDF 
value  which  results  is  the  probability  on  a  zero-to-one  scale  that  the  true  response  at  point  x 

is  less  than  T  +  E,  i.e.,  the  calculation  is  repeated  with  T  —  E  instead  of  F  +  £,  the  results  can 
be  substituted  into  Equation  8  to  obtain  the  weighting  function  value  for  the  test  point  x.  ^ 

6.3.3  Application  of  Mathematical  Framework 

The  necessary  mathematical  tools  to  evaluate  a  candidate  point  have  now  been  collected. 
Equation  2  captures  how  the  candidate  will  affect  the  inverse  covariance  matrix  and  thus  the 
prediction  variance  (via  Equation  4)  at  each  test  point.  This  prediction  variance  is  weighted 
by  Equation  8,  which  emphasizes  test  cases  where  the  response  value  is  expected  to  be  close  to 
the  value  of  inter-  est.  The  weighted  prediction  variances  are  then  summed  to  produce  a 
weighted  Integrated  Mean  Squared  Error  (wIMSE)  score.  This  wIMSE  score  quantifies  the 
amount  of  prediction  variance,  or  uncertainty,  which  would  be  present  if  the  candidate  point 
were  added  to  the  model. 

If  the  problem  only  features  one  response,  the  algorithm  simplifies  to  the  form  proposed  by 
Picheny  et  ah,  and  the  candidate  with  the  lowest  wIMSE  score  would  be  selected  as  the  next 
sample  because  that  candidate  is  expected  to  produce  the  largest  reduction  in  prediction 
variance  for  that  response. 

If  the  problem  features  multiple  responses,  further  steps  are  required  before  the  best  candidate 
can  be  identified.  First,  because  the  prediction  variance  of  each  response  may  differ  by 
multiple  orders  of  magnitude,  the  wIMSE  scores  for  each  response  must  be  normalized.  This  is 
accomplished  using  the  mean  and  standard  deviation  of  the  wIMSE  scores  for  that  response, 

IJwiMSE  and  0„imse- 

The  normalized  wIMSE  score  for  the  candidate  is  thus: 


wIMSE.  = 

^  I, norm 

^  wIMSE 


(13) 


Once  wIMSE  scores  have  been  normalized  for  all  responses,  the  average  wIMSE  score  for 
each  candidate  is  calculated  based  on  its  wIMSE  score  for  each  response.  The  candidate  with  the 
smallest  average  wIMSE  score  is  chosen  as  the  next  point  to  be  sampled. 

6.4  Use  of  Alternative  Surrogate  Modeling  Methods 

It  should  be  noted  that  Kriging  is  not  necessarily  required  for  this  algorithm.  Strictly 
speaking,  the  algorithm  requires  only  (a)  a  method  of  predicting  the  response  value  at  some 
new  point,  and  (b)  a  way  of  estimating  the  prediction  variance  or  uncertainty  at  that  point.  Any 
surrogate  modeling  technique  should  allow  the  prediction  of  response  values,  and  cross- 
validation  offers  a  way  to  estimate  prediction  uncertainty  at  any  given  point.  [148]  However, 

^Remember  that  Equation  9  is  only  valid  for  Z  >  0,  i.e.,  if  (!’+  £)  >  m„{x).  If  the  response  at  point  x  is  expected  to  be 
less  than  T+  £,  for  example,  and  thus  Z  <  0,  Equation  9  will  produce  nonsensical  results.  Whenever  Z  <  0,  Z  should  be 
replaced  by  — Z  to  stay  within  the  applicable  range  for  Equation  9.  This  transforms  0(x)  from  the  likelihood  that  Z  >  0  (P{Z  > 
0))  into  the  likelihood  that  Z  <0  (P(Z<0)).  This  can  be  easily  accounted  for  by  use  ofthe  identity  P(Z  >  0)  =  1  —P{Z  <  0) 
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cross-validation  requires  the  generation  of  many  additional  surrogate  models,  whieh  may 
beeome  very  demanding  in  terms  of  time  and  eomputational  effort.  The  attraetiveness  of 
Kriging  for  this  effort  stems  from  the  faet  that  with  a  Kriging  model,  predietion  varianee  may 
be  ealeulated  direetly  with  relatively  low  eomputational  effort,  and  the  predietion  varianee  is 
known  to  have  a  Gaussian  distribution.  If  Kriging  were  to  be-  eome  exeessively  expensive,  sueh 
as  if  a  very  large  data  set  beeame  neeessary,  eross-validation  and  an  alternative  surrogate 
modeling  method  should  still  allow  this  sampling  algorithm  to  be  applied  with  minimal 
modifieations  -  albeit  at  what  is  expeeted  to  be  signifieantly  inereased  eomputational  effort  & 
time. 

6.5  Verification  of  Sampiing  Aigorithm 

Before  any  implementation  of  an  algorithm  may  be  applied,  it  should  first  be  tested  to  verify  that 
the  algorithm  as  eoded  performs  as  intended.  This  serves  to  eonfirm  that  the  implementation 
aeeurately  reproduees  the  intended  algorithm,  and  that  the  results  produeed  by  the 
implementation  adequately  refleet  the  performanee  of  the  algorithm.  [8,  139]  Onee  the  sample- 
seleetion  eode  is  verified  to  be  performing  as  intended,  its  performanee  eould  be  assessed  with 
greater  eonfidenee. 

Toward  this  end,  four  verifieation  experiments  were  eondueted.  The  first  three  verifieation 
experiments  would  feature  two  input  parameters  and  one  output  parameter,  similar  to  a 
demonstration  given  by  Pieheny  et  ah,  [149]  to  assess  the  basie  behavior  of  the  algorithm  and  its 
subroutines.  The  final  verifieation  experiment  would  feature  two  input  parameters  and  three 
output  parameters  to  verify  algorithm  behavior  for  multiple  responses. 

CartSD  data  would  be  used  for  the  first  and  last  verifieation  experiments  so  that  the  algorithm 
eould  be  demonstrated  for  a  response  representative  of  the  intended  applieation.  To  seleet  the 
2  free  parameters  from  the  49  parameters  used  in  the  RBS  effort,  a  sensitivity  study  was 
performed  to  quantify  the  relative  eflfeet  of  eaeh  free  parameter  exeluding  eontrol  defleetions. 
This  study  indieated  that  for  the  three  flight  eonditions  analyzed,  the  wing  root  ehord  fraetion 
and  the  fuselage  radius  fraetion  had  the  largest  average  impaet  on  vehiele  pitehing  moment. 

Using  these  two  parameters  as  independent  variables,  the  first  verifieation  experiment  would 
apply  the  eontour-based  sampling  algorithm  to  one  flight  eondition.  This  would  allow  the 
eomparison  of  the  behavior  of  the  implemented  algorithm  against  the  behavior  deseribed  by 
Pieheny  et  al.  Two  other  verifieation  tests  were  developed  based  on  standard  test  funetions 
used  in  the  optimization  field  due  to  the  similarity  between  adaptive  sampling  algorithms  and 
optimization  algorithms. 

The  final  verifieation  experiment  would  assess  the  performanee  of  the  algorithm  when  applied  to 
three  flight  eonditions  simultaneously.  This  experiment  would  investigate  how  the  algorithm 
would  behave  when  applied  to  a  problem  with  multiple  responses.  This  represents  an 
extension  beyond  what  has  been  deseribed  in  the  literature  for  this  algorithm. 

Before  these  experiments  eould  eommenee,  default  values  for  the  other  47  geometrie  parameters 
had  to  be  seleeted.  An  iterative  approaeh  was  used  to  identify  a  set  of  default  values  sueh  that 
the  two-dimensional  design  spaee  would  inelude  eonfigurations  with  small  pitehing  moments  at 
all  three  flight  eonditions.  The  values  that  were  seleeted  have  been  doeumented  in  Appendix  B. 
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Department  of  Defense  High  Performance  Computing  Center  resources  allowed  this  approach 
to  be  performed  in  a  rapid,  massively-parallel  fashion,  significantly  reducing  the  time  required. 

6.6  First  Sampling  Verification  Experiment:  Two  Inputs,  One 
Response 

This  verification  experiment  served  to  confirm  that  the  algorithm  proposed  by  Picheny  et  al. 
had  been  implemented  correctly.  It  was  expected  that  the  test  problem  was  simple  enough  that 
any  aberrant  behavior  could  be  identified  relatively  easily.  The  Mach  2.5,  cr  0°  flight  condition 
was  chosen  for  this  test  because  at  this  condition,  the  test  problem  exhibits  fairly  simple 
response  behavior,  as  seen  in  Figure  9. 

A  large  fraction  of  the  design  space  produced  pitching  moment  coefficients  (Cm)  within  the 

region  of  interest,  i.e.,  less  than  |0.1|,  with  the  exception  of  cases  with  a  large  fuselage 
radius  fraction  (Radf)  and  a  small  root  chord  fraction  {RC).  Such  cases  exhibited  a  pitching 
moment  that  was  more  negative  than  desired.  It  was  expected  that,  once  this  trend  was 
identified,  the  algorithm  would  avoid  placing  samples  in  the  undesirable  region  and  instead 
emphasize  the  region  where  the  pitching  moment  is  close  to  zero. 

6.6.1  Contributing  Analyses:  Prediction  Variance 

Before  verifying  the  overall  behavior  of  the  algorithm,  a  number  of  intermediate  checks  were 
per-  formed  to  build  confidence  in  the  underlying  calculations.  Because  the  algorithm  depends 
on  the  accuracy  of  the  estimated  variances,  the  prediction  variance  values  produced  by  the 
implemented  algorithm  were  evaluated  first. 


Figure  9:  Pitching  Moment  Coefficient  at  Mach  2.5,  a  0° 

From  a  conceptual  standpoint,  the  prediction  variance  was  expected  to  be  small  near  existing 
samples  where  the  response  value  was  known  exactly  and  grow  larger  for  points  farther  from 
the  existing  samples.  [182]  Applying  these  calculations  to  a  two-dimensional  problem  allowed 
the  output  to  be  evaluated  visually.  This  evaluation  would  serve  as  the  first  check  on  the 
calculated  values:  if  this  behavior  was  not  observed,  there  was  likely  a  problem  with  the 
implementation. 
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The  DACE  toolbox  by  Lophaven  et  al.[  1 07]  for  Matlab[  123]  has  become  a  popular  utility  for  the 
creation  of  Kriging  models. [54,  62,  187,  203]  In  addition,  the  toolbox  offers  the  option  to 
estimate  prediction  variance  at  any  point.  These  estimates  would  be  compared  to  those 
produced  by  the  algorithm;  this  comparison  would  serve  as  the  second  check  on  the 
calculated  values.  If  good  agreement  was  found,  this  would  be  taken  as  a  sign  that  the 
variance  estimation  portion  of  the  algorithm  was  working  as  intended. 


A  50  X  50  grid  of  configurations  was  generated,  spanning  the  ranges  of  the  wing  root  chord 
fraction  and  the  fuselage  radius  fraction,  and  these  configurations  were  analyzed  with  Cart3D 
at  Mach  2.5,  or  0  ° .  A  five-point  Latin  hypercube  was  generated  and  the  pitching  moment  for  each 
point  was  estimated  by  interpolating  the  grid  results.  These  cases  and  the  respective  pitching 
moment  for  each  were  then  used  to  train  a  Kriging  model  via  the  DACE  toolbox.  That 

Kriging  model  was  the  source  of  the  baseline  inverse  covariance  matrix  (C  *  in  Equation  7) 
which  was  then  used  to  calculate  the  augmented  covariance  matrix  if  a  particular  candidate 
point  were  added  to  the  model.  The  results  of  this  test  may  be  seen  in  Figure  10. 

In  this  figure,  the  triangles  represent  the  five  points  used  to  create  the  Kriging  model.  In 
Figure  I  Ob,  the  square  in  the  lower  left  represents  the  candidate  point  that  is  being  added  to  the 
model  (assuming  that  the  response  value  at  that  point  does  not  significantly  affect  the 
estimated  model  weights). 

Note  that  in  both  Figure  10a  and  Figure  10b,  the  prediction  variance  is  smallest  in  the 
neighborhood  immediately  around  each  point.  Additionally,  the  prediction  variance  is  relatively 
low  in  the  center  of  the  space  where  all  samples  are  somewhat  nearby,  and  grows  to  large  values 
in  the  comers  of  the  space,  which  are  the  farthest  from  the  sample  points. 
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Figure  10:  Comparison  of  Prediction  Variance  Estimates  Produced  by  (a)  DACE  and  (b)  tbe  Implemented 

Algorithm 


In  Figure  I  Ob,  the  new  sample  point  clearly  reduces  the  nearby  variance  values.  The  close 
agreement  between  the  two  images  (both  visually  and  numerically)  as  well  as  with  the 
expected  behavior  indicates  that  this  aspect  of  the  algorithm  is  functioning  correctly. 
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6.6.2  Contributing  Analyses:  Weighting  Functions 

After  prediction  variance,  the  next  calculations  to  be  verified  were  those  supporting  the 
weighting  function.  The  weighting  function  was  used  to  identify  regions  where  additional 
experiments  would  best  improve  the  Kriging  model;  errors  in  the  calculation  of  the  weighting 
function  would  be  harmful  to  the  accuracy  and  elficiency  of  the  method. 

The  weighting  function  can  be  thought  of  as  the  probability  that  the  value  of  the  response  at  a 
given  test  point  falls  within  the  region  of  interest.  Equation  8  stated  that  the  weighting  function 
is  equal  to  the  probability  that  the  response  will  be  less  than  the  upper  threshold,  minus  the 
probability  that  the  response  will  be  less  than  the  lower  threshold.  Using  Equations  9,  11,  and 
12,  those  probabilities  can  be  calculated  using  only  values  for  the  response  and  variance  at  that 
point,  both  of  which  can  be  calculated  by  the  Kriging  surrogate  model. 

Conceptually,  it  was  expected  that  the  weighting  function  might  exhibit  a  variety  of  behaviors.  If 
at  some  point  the  estimated  response  was  between  the  two  cutoff  values  and  the  prediction 
variance  was  small,  the  weighting  function  at  that  point  should  be  close  to  1 ,  which  would 
indicate  that  there  was  a  very  high  likelihood  that  the  true  response  at  that  point  fell  within  the 
range  of  interest.  As  the  variance  grew  larger  (i.e.,  the  confidence  in  the  estimate  decreased) 
there  was  an  increasing  chance  that  the  actual  response  at  that  point  was  outside  the  range  of 
interest,  and  thus  the  weight  would  decrease.  Alternatively,  as  the  predicted  response  moved 
farther  from  the  region  of  interest  the  weight  would  decrease  as  it  became  less  likely  that  the 
actual  response  still  fell  within  the  region  of  interest. 

Eigure  1  la  depicts  the  predicted  responses  throughout  the  space  based  on  the  initial  space-filling 
cases.  In  this  example,  the  Kriging  model  uses  a  linear  underlying  model.  Recall  from 
Equation  1  that  the  predicted  response  is  a  combination  of  the  underlying  model,  which 
captures  the  general  trends  of  the  response,  and  the  covariance  matrix  which  accounts  for 
deviations  from  that  trend.  In  this  case  there  are  relatively  few  samples,  and  the  response 
predictions  are  dominated  by  the  linear  model.  The  covariance  effects  may  be  seen  only  in  the 
close  vicinity  of  the  sample  points.  The  prediction  variance  for  this  example  may  be  seen  in 
Eigure  10a.  In  essence,  the  variance  was  small  near  the  samples  and  large  near  the  edges  of  the 
space. 

The  calculated  weights  throughout  the  space  are  plotted  in  Eigure  11b.  The  weights  are  primarily 
driven  by  the  predicted  response:  high  weights  are  observed  where  the  response  is  expected  to 
fall  within  the  range  of  interest,  and  low  weights  where  the  response  is  far  from  the  range  of 
interest.  The  weight  value  smoothly  tapers  between  those  regions,  indicating  points  which  are 
unlikely  to  have  good  response  values,  but  with  enough  prediction  uncertainty  that  they 
cannot  be  ruled  out  entirely. 
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Figure  11:  Examples  of  Predicted  Cm  and  Weighting 


It  is  critical  to  note  that  the  weighting  funetion  depends  on  the  estimated  response  and  variance 
at  each  point.  Comparing  the  aetual  response  in  Figure  9  and  the  predieted  response  in  Figure  11, 
it  is  clear  that  with  only  five  samples,  the  initial  Kriging  model  achieves  only  a  rough 
approximation  of  the  response  behavior.  As  a  result,  the  algorithm  may  at  first  seleet  sub- 
optimal  candidates  due  to  imperfect  information.  The  model  is  re-generated  after  each 
sample,  however,  and  will  progressively  improve  itself  as  more  information  beeomes  available. 
Based  on  this  qualitative  assessment,  the  implementation  of  the  weighting  function  appears  to 
be  funetioning  as  intended. 

6.6.3  Contributing  Analyses:  wIMSE  Calculation 

Having  verified  the  performanee  of  various  eomponents  of  the  algorithm,  the  behavior  of  the 
algorithm  as  a  whole  could  be  assessed.  The  intent  was  to  evaluate  eandidate  points  and 
identify  the  one  whieh  most  reduced  predietion  uncertainty  (i.e.,  variance)  in  regions  where  the 
response  value  was  within  the  desired  range.  It  had  been  qualitatively  demonstrated  that  the 
algorithm  eorrectly  modeled  how  a  new  sample  would  aflfeet  nearby  predietion  varianee.  It  had 
also  been  qualitatively  demonstrated  that  the  algorithm  eould  assign  proper  weights  to  oases 
based  on  estimated  response  and  variance  values.  It  only  remains  to  combine  the  two  features 
into  a  single  scoring  metric. 

Weighted  Integrated  Mean  Squared  Error,  or  wIMSE,  was  the  metrio  used  to  rank 
candidates.  The  unweighted  form,  IMSE,  was  used  by  Kleijnen  and  Van  Beers[97]  as  an 
approximation  of  the  overall  prediction  uncertainty  in  the  design  spaoe.  Integrating  the  variance 
analytioally  would  be  difficult  at  best;  Kleijnen  and  Van  Beers  use  numerical  integration,  by 
way  of  adding  up  the  variance  at  eaoh  point  in  a  representative  set  of  samples.  The  eandidate 
point  that  most  reduoed  the  unweighted  IMSE  eould  be  thought  of  as  the  point  which  best 
reduced  overall  predietion  uncertainty. 

The  use  of  a  weighting  factor  allowed  certain  regions  to  be  emphasized  and  others  downplayed. 

In  this  case,  the  weighting  faetor  was  large  in  regions  with  desirable  response  performance  - 
for  example,  regions  where  the  vehicle  pitehing  moment  was  elose  to  zero  -  and  small  in 
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regions  with  poor  response  performanee.  Thus,  the  eandidate  with  the  smallest  wIMSE  value 
was  the  ease  expeeted  to  most  reduee  predietion  varianee  for  the  region  or  regions  of  interest. 

If  the  algorithm  was  funetioning  as  intended,  it  should  select  candidates  which  promised  the 
largest  reduction  for  prediction  variance  in  regions  of  interest.  It  should,  therefore,  select 
candidates  in  regions  of  large  variance  and/or  in  regions  where  the  weighting  function,  as  a 
proxy  for  level  of  interest,  was  also  large. 

The  algorithm  was  presented  with  a  23  by  23  grid  of  candidates  and  a  40  by  40  grid  of  test  points. 
Pitching  moment  coefficient  values  at  each  of  those  points  were  calculated  by  interpolating  the 
50 by  50  grid  of  Cart3D  results.  Grid  sampling  could  quickly  become  uneconomical  for  large 
real-world  problems,[193]  but  for  the  problem  at  hand  it  offered  the  significant  advantage  of 
repeatability. 

Given  those  candidates  and  test  points,  the  variation  of  wIMSE  score  throughout  the  design 
space  could  be  calculated  and  depicted  visually.  Figure  12  shows  wIMSE  scores  for  the  grid  of 
candidates.  The  most  attractive  candidate  was  the  one  that  produced  the  largest  reduction  in 
weighted  variance,  which  corresponded  to  the  smallest  wIMSE  score.  In  Figure  12  this 
candidate  is  marked  as  a  white  solid  circle. 

Comparing  this  image  with  the  estimated  variance  values  in  Figure  10a  and  the  weights  in 
Figure  1  lb,  a  few  observations  can  be  made.  The  wIMSE  values  indicated  samples  in  the  top- 
right  and  bottom-left  comers  were  attractive,  which  matched  the  relatively  high  prediction 
variances  in  those  regions.  Candidates  close  to  existing  samples  were  rated  as  unattractive, 
which  agreed  with  the  observation  that  prediction  variance  in  those  regions  was  already  low; 
there  was  more  room  for  improvement  elsewhere.  Finally,  the  upper-left  corner  was  considered 
to  be  significantly  more  desirable  than  the  lower-right  corner  even  though  both  had  roughly 
equal  levels  of  variance.  This  showed  the  influence  of  the  weighting  factor,  which  favored 
candidates  in  the  upper-left  based  on  the  high  probability  that  those  cases  had  desirable 
response  values. 
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Figure  12:  wIMSE  Demonstration 
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Qualitatively,  the  algorithm  appeared  to  be  functioning  as  expected.  The  final  test  in  the  set  was 
to  evaluate  the  performance  of  the  algorithm  quantitatively. 

6.6.4  Evaluation  of  Accuracy 

Given  the  initial  set  of  five  space-filling  data  points,  the  algorithm  was  used  to  select  fifteen 
adaptive  samples.  Kriging  surrogates  were  created  after  each  sample  selection  and  used  to 
predict  the  pitching  moment  at  candidate  and  test  points  throughout  the  space,  once  again  using 

the  23  X  23  grid  of  candidates  and  40  X  40  grid  of  test  points.  In  each  round,  after  a  candidate 
was  selected  as  the  next  sample,  the  actual  pitching  moment  for  that  point  was  determined  by 

interpolating  the  50  X  50  grid  of  Cart3D  analyses. 

The  distribution  of  the  samples  appears  in  Figure  13a  &  Figure  13b.  These  images  show  the 
initial  five  space-filling  cases  (triangles),  and  fifteen  samples  selected  by  contour-based  sampling 
(circles).  Figure  13b  denotes  the  order  of  the  samples  chosen,  laying  them  over  a  full  contour 
plot  of  the  estimated  response.  The  next  point  to  be  sampled  is  marked  by  a  filled  circle  in 
Figure  13a  and  by  the  point  labeled  16  in  Figure  13b. 

Based  on  the  response  values  for  the  cases  that  had  been  sampled,  the  response  behavior  through¬ 
out  the  space  was  estimated.  Figure  13a  shows  the  estimated  region  of  interest,  based  on  the 

Kriging  model,  which  is  bounded  by  the  smooth  solid  contour  line  marked  “—0.1”.  The  actual 
region  of  interest  is  bounded  by  the  somewhat  more  erratic  dotted  line.  There  is  fairly  good 
agreement  between  the  two. 
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Figure  13:  Distribution  of  Samples  &  Order  of  Sample  Selection 


The  next  notable  observation  was  the  overall  distribution  of  the  samples  selected  by  the 
algorithm.  Samples  were  clustered  tightly  together  in  the  region  of  interest.  The  lower-right  area, 
where  cases  are  highly  unlikely  to  be  of  interest,  was  untouched  except  for  the  initial  space- 
fdling  cases.  If  the  samples  were  evenly  distributed  throughout  the  design  space,  more  samples 
would  have  been  placed  in  this  region  in  the  lower-right.  Instead,  the  algorithm  correctly 
identified  the  region  as  being  of  low  interest  and  did  not  place  samples  there. 
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Another  observation  deserving  emphasis  was  the  order  of  the  samples.  Figure  1 1  shows  the 
initial  predietion  of  response  behavior;  the  lower-left  comer  was  also  expected  to  be  of  little 
interest.  As  seen  in  Figure  13b,  the  exploratory  sampling  which  identified  this  region  as 
interesting  did  not  occur  until  the  eighth  or  ninth  round.  Prior  to  that  round,  the  algorithm 
determined  that  interior  samples  would  improve  prediction  variance  more  than  exploration  of 
the  edges  of  the  estimated  region  of  interest. 

This  is  reflected  in  Figure  14,  which  shows  how  the  2.5  percent  and  97.5  percent  prediction 
error  quantiles  for  the  model  changed  as  each  new  sample  was  selected  and  incorporated.  Error 
here  was  defined  as  7  ,^-Y  ,  ,  so  that  negative  errors  indicated  too-low  predictions  and 

positive  errors  indicated  too-high  predictions.  The  2.5  percent  and  97.5  percent  quantiles 
represented  some  of  the  most-negative  and  most-positive  prediction  errors  made  by  the  model. 
Note  that  the  2.5  percent  and  97.5  percent  quantiles  encompassed  95  percent  of  the  prediction 
errors.  The  average  prediction  error  was  between  the  two  values. 


Figure  14:  2.5%  &  97.5%  Prediction  Error  Quantiles  for  Adaptive  &  Space-Filling  Sampling:  Entire  Space 

For  comparison,  for  each  number  of  samples,  equivalent-sized  Latin  hypercubes  were 
generated  and  used  to  build  models.  To  minimize  the  chance  that  lucky  or  unlucky  sample 
distributions  might  skew  the  comparison,  many  hypercubes  were  generated  for  each  sample  size. 
The  number  of  hypercubes  was  increased  until  the  average  metric  value  displayed  relatively 
smooth  trend  behavior  rather  than  erratic  noisy  behavior.  To  produce  the  results  illustrated 
below,  three  hundred^  Latin  hypercubes  were  generated  for  each  sample  size. 

Note  that  after  the  fifteenth  sample  (i.e.,  the  tenth  adaptive  sample  plus  the  five  initial  cases), 
the  lower  quantile  for  adaptive  sampling  improved  markedly.  This  corresponded  to  a  shift 
from  predictions  which  were  previously  too  low  and  were  now  closer  to  the  correct  value. 

Prior  to  the  tenth  sample  being  placed  there,  the  response  in  that  region  was  expected  to  be  more 


2 

The  goal  of  hypercube  sampling  was  to  estimate  the  performance  of  an  average  space-filling  sample 
design.  Three  hundred  repetitions  at  each  hypercube  size  gave  fairly  smooth  average  results  without 
requiring  excessive  analysis  time. 
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negative  than  it  truly  was;  after  the  tenth  sample,  the  algorithm  appeared  to  have  a  firm  grasp 
on  the  interesting  regions  of  the  problem. 

From  Figure  14,  the  reader  might  eonelude  that  the  sampling  algorithm  and  the  average  space¬ 
filling  sample  distribution  are  roughly  evenly  matched  for  this  problem.  Before  drawing  that 
conclusion,  however,  bear  in  mind  that  the  main  objective  of  this  algorithm  was  to  maximize 
prediction  confrdence  and  accuracy  only  for  cases  where  the  response  value  is  within  some 
range  of  interest.  To  this  end,  the  prediction  error  quantiles  were  again  calculated,  but  this  time 

only  for  points  where  the  response  fell  within  the  specified  range  of  \Cm\  ^0.1.  Those 
quantiles,  calculated  for  both  the  sampling  algorithm  and  the  Latin  hypercubes,  are  plotted  in 
Figure  15. 


Figure  15:  2.5%  &  97.5%  Prediction  Error  Quantiles  for  Adaptive  &  Space-Filling  Sampling: 

Region  of  Interest  Only 

When  only  the  cases  of  interest  are  evaluated,  the  sampling  algorithm  was  found  to  be 
significantly  more  accurate  than  the  average  Latin  hypercube.  After  the  tenth  sample,  the 
sampling  algorithm  had  successfully  identified  and  sampled  all  regions  of  interest  within  the 
design  space.  The  prediction  error  quantiles  then  became  tightly  grouped  around  the  line  of  zero 
error,  in  contrast  to  the  quantiles  for  space-filling  samples. 

Unlike  the  previous  image,  here  the  quantiles  for  the  average  hypercube-based  model  did  not 
bracket  zero,  but  rather  both  were  below  zero.  Because  the  average  error  commonly  lies 
between  the  two  quantiles,  this  result  suggested  a  bias  in  the  predictions  of  hypercube-based 
models:  the  responses  at  interesting  cases  were  being  consistently  under-predicted,  a  behavior 
that  was  not  oh-  served  in  the  adaptive  sampling  results. 

6.7  Second  Sampling  Verification  Experiment:  Perm  Function 

The  Perm  function  is  a  test  function  used  to  evaluate  the  performance  of  optimization 
algorithms. [196]  This  function  can  be  generated  with  n  dimensions  where  n  is  any  integer;  for 
this  application  only  two  dimensions  were  used.  The  equation  for  the  function  is: 
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In  this  function,  P  can  be  varied  whieh  affects  how  closely  the  loeal  minima  approximate  the 
global  minimum.  The  function  is  evaluated  over  the  range  -  n  <  x,  <  n  and  the  global 
minimum  is  /(x)  =  0  at  x,  =  i  where  z  =  1  ■  •  ■ «  .  For  this  experiment,  P  was  set  to  0.5  on  the 

recommendation  of  Hedar.[77]  For  this  value  of  P ,  the  funetion  produees  values  from  0  to 
slightly  over  100. 

The  behavior  of  the  function  is  plotted  in  Figure  16a.  The  funetion  generally  inereases  toward 
the  low  end  of  eaeh  variable,  and  a  hump  is  present  in  the  eenter  of  the  spaee.  Due  to  the 
interplay  of  these  two  behaviors,  many  of  the  points  in  the  lower-left  quadrant  have  similar / 

(x)  values.  Seleeting  60  ±  5  as  the  “range  of  interest”  for  this  function  results  eorresponds  to  the 
region  plotted  in  Figure  16b.  This  area  has  nonlinear  edges  and  is  mildly  eoneave. 

The  sampling  algorithm  was  applied  to  this  test  problem  for  the  stated  range  of  interest.  A  5- 
point  Latin  hypereube  was  used  to  initialize  the  algorithm,  after  whieh  1 5  samples  were 
selected.  After  each  selection,  a  new  Kriging  model  was  trained  and  used  to  prediet  the 
response  values  for  the  region  shown  in  Figure  16b.  The  predietions  were  eompared  to  the 
aetual  response  values  for  those  points,  and  the  predietion  error  and  Root  Mean  Squared 
Error  (RMSE)  for  the  predietions  were  ealeulated.[85] 
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Additionally,  1,000  Latin  hypercubes  were  generated  for  each  sample  size  (e.g.,  6,  7,  ...)•  A  new 
Kriging  model  was  trained  for  each  hypercube  and  the  model  was  used  to  predict  response  values 
for  the  cases  of  interest.  The  average  prediction  RMSE  for  each  sample  size  was  recorded.  The 
RMSE  values  produced  by  Eatin  hypercubes  and  contour-based  sampling  are  compared  in 
Eigure  17. 
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Figure  17:  Comparing  Predictive  Accuracy  for  Perm  Function  Region  of  Interest 

The  model  that  used  contour-based  sampling  started  out  with  worse  performance,  and  at  first  it 
did  not  improve  very  rapidly  compared  to  space-filling  samples.  After  15  samples,  the  model 
based  on  adaptive  sampling  had  identified  the  region  of  interest  and  had  reduced  its  prediction 
error  by  an  order  of  magnitude  compared  to  the  average  hypercube  performance  with  an  RMSE 
of  0.160  versus  1.86  for  hypercubes. 

6.8  Third  Sampling  Verification  Experiment:  Sphere  Function 

The  Sphere  function  is  another  test  function  from  the  field  of  numerical  optimization.  [146]  It  is 
a  fairly  simple  function; 


(16) 

/=i 

This  function  can  also  take  any  number  of  dimensions;  pnce  again  two  dimensions  will  be  used 

for  the  sake  of  simplicity.  The  search  domain  for  this  function  was  set  to  —5.12  <  x,  <  5.12 
based  on  the  recommendation  of  Hedar.[78]  This  function  has  a  global  minimum  of  0  when  x/ 
=  0  for  all  i. 

The  behavior  of  the  function  is  shown  in  Eigure  18a. 
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Figure  18:  Sphere  Function  &  Region  of  Interest 

It  was  decided  to  make  this  evaluation  more  challenging  than  the  previous  tests.  Rather  than 
choosing  a  convex  region  centered  about  the  minimum  value,  the  range  of  interest  was  defined  to 

be  5  <  f{x)  <15.  Cases  which  met  this  criterion  are  marked  with  small  circular  icons  in 
Figure  18b.  These  cases  were  spread  across  a  large  portion  of  the  design  space  and  the 
grouping  was  non-  convex.  Furthermore,  the  linear  underlying  trend  of  the  Kriging  models 
was  not  a  good  match  for  this  function,  which  meant  that  the  estimated  response  values  used 
for  the  adaptive  sampling  algorithm  might  be  significantly  inaccurate.  Once  again,  the  predictive 
accuracy  of  the  model  which  used  contour-based  sampling  was  compared  against  the  average 
prediction  RMSE  of  1,000  Latin  hypercubes  at  each  sample  size.  The  results  of  this  comparison 
can  be  seen  in  Figure  19. 


Figure  19:  Comparing  Predictive  Accuracy  for  Sphere  Function  Region  of  Interest 

Models  using  the  space-filling  samples  once  again  showed  smooth  improvement,  and  after  15 
samples  reached  diminishing  returns  as  the  response  behavior  was  fairly  well  understood. 
The  model  based  on  adaptive  sampling,  on  the  other  hand,  could  at  best  be  said  to  struggle 
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with  this  problem.  The  first  four  samples  produeed  signifieant  decreases  in  predietive 
aeeuraey.  Subsequent  samples  showed  a  gradual  improvement  at  a  rate  slower  than  the  rate 
exhibited  by  hypereube  sampling.  Only  in  the  final  2  samples  did  the  model  based  on  adaptive 
sampling  identify  the  true  form  of  the  response  and  reaet  aeeordingly. 

The  large  vertieal  seale  required  to  inelude  all  the  data  points  preeluded  a  detailed  visual 
eomparison  of  the  two  methods  at  the  end  of  sampling.  It  was  found  that  after  20  samples,  the 
average  spaee-filling  model  eould  elaim  a  predietion  RMSE  of  0. 1 09  while  the  model  based  on 
adaptive  sampling  produeed  an  RMSE  of  0. 1 5 1 .  Although  the  adaptive  model  showed 
substantial  improvement,  that  does  not  exeuse  the  poor  performanee  whieh  was  demonstrated 
for  the  bulk  of  the  experiment. 

The  eontour-based  sampling  algorithm  was  shown  to  perform  well  for  problems  whieh  have 
simple  regions  of  interest,  but  the  algorithm  eould  be  applied  to  other  problems  with  some 
aeeompanying  loss  of  effeetiveness.  In  addition,  beeause  a  surrogate  model  is  used  when 
evaluating  eandidate  points,  the  aeeuraey  of  that  surrogate  model  ean  signifieantly  affeet  the 
effeetiveness  and  eflfieieney  of  the  sample  seleetion  proeess.  Even  when  the  Kriging  model  is  a 
poor  representation  of  the  response  behavior,  eontinued  sampling  ean  eventually  eorreet  this 
shorteoming.  Unfortunately,  there  was  no  obvious  way  to  know  how  many  more  samples  will 
be  required  before  the  algorithm  would  eorreet  itself 

These  experiments  served  to  verify  that  the  sample  seleetion  algorithm  performed  as  expeeted. 
Eneouragingly,  the  algorithm  had  in  many  oases  been  shown  to  provide  better  predietive 
oapability  for  oases  of  interest  oompared  to  hyperoube-based  results.  On  the  strength  of  these 
results,  the  algorithm  was  extended  to  handle  multiple  responses  simultaneously. 

6.9  Fourth  Sampling  Verification  Experiment:  Two  Inputs,  Three 
Responses 

The  evidenoe  so  far,  both  as  part  of  this  effort  and  in  the  literature,  had  only  demonstrated 
oontour-  based  sampling  for  a  single  response.  With  regard  to  the  problem  at  hand,  the 
aerodynamio  moments  on  vehioles  must  be  oontrollable  at  all  expeeted  flight  oonditions,  not 
one.  The  algorithm  would  thus  be  modified  to  identify  eases  whieh  were  benefieial  for 
multiple  responses  (i.e.,  eases  whieh  improved  predietion  aeouraey  for  small  pitehing  moments 
at  multiple  flight  eonditions)  simultaneously. 

This  was  aeeomplished  by  ereating  an  overall  sampling  eriterion  whieh  ineorporated  the 
performanee  of  a  given  eandidate  at  all  flight  eonditions.  Eor  eaeh  eandidate,  the  wIMSE 
seore  is  ealeulated  at  every  flight  eondition  as  detailed  above.  Onee  all  wIMSE  seores  have  been 
ealeulated,  the  seores  for  eaeh  flight  eondition  are  normalized  (as  given  in  Equation  13)  and 
then  the  average  normalized  wIMSE  seore  is  ealeulated  for  eaeh  eandidate.  The  eombined 
seore  shall  heneeforth  be  referred  to  as  joint  wIMSE.  This  synthesizes  the  wIMSE  information 
and  allows  easy  ranking  of  eandidates.  Before  this  approaeh  ean  be  adopted,  however,  it  must 
be  demonstrated  as  eflfeetive. 

To  minimize  data  generation  requirements,  the  data  pool  from  the  first  sampling  verifieation 
experiment  was  retained.  Additional  data  was  generated  for  the  same  design  spaee  at  the  other 
two  flight  oonditions,  Maoh  0.3, a  15°, jS  0°  and  Maoh  0.8,a0°,j6  0°.  The  response  behavior 
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for  these  two  flight  conditions  may  be  seen  in  Figure  20.  Note  the  sharp  variations  observed 

-5 

at  Mach  0.8.  Although  the  pitching  moment  was  converged  for  all  cases  or  interpolated  from 
nearby  converged  results,  these  sharp  variations  in  response  were  still  observed.  At  the  present 
time  it  is  believed  that  they  result  from  the  transonic  flight  condition,  a  condition  which  the 
Euler  CFD  tool  may  be  hard  pressed  to  model. 


AoA  15  degrees 


Mach  0.8 
AoA  0  degrees 


Figure  20:  Pitching  Moment  Coefficient  at  Mach  0.3,  a  15°  and  Mach  0.8,  a  0° 


Using  the  50  X  50  grid  of  samples,  cases  of  interest  for  each  flight  condition  are  highlighted  in 

Figure  21a,  22b,  and  22c.  Figure  22d  illustrates  the  cases  which  have  \Cm\  <  0. 1  at  all  three 
flight  conditions  simultaneously.  The  jaggedness  of  the  transonic  response  behavior  could  be 
inferred  from  the  erratic  distribution  of  cases  of  interest  in  Figure  22b.  The  Mach  0.3  flight 
condition  was  clearly  the  most  restrictive,  although  there  were  cases  which  produced  an 
acceptable  Cm  at  Mach  0.3  but  an  overly-negative  Cm  at  Mach  2.5. 


Convergence  for  these  efforts  was  defined  as  a  standard  deviation  of  less  than  0.05  and  less 
than  5%  of  the  average  response  value  when  evaluated  over  the  final  twenty  iterations  of  the 
flow  solver. 
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Figure  21:  Cases  of  Interest  at  Each  Flight  Condition  and  At  All  Flight  Conditions 

Once  the  algorithm  was  modified  to  evaluate  all  three  flight  eonditions,  sampling  began.  The 
same  five  initial  spaee-filling  samples  were  used  for  this  experiment  so  that  any  deviations  from 
the  previous  sample  seleetions  may  be  attributed  to  the  inelusion  of  additional  responses.  The 
resulting  distribution  of  samples  is  shown  in  Figure  22a.  Comparing  the  samples  seleeted  against 
the  regions  of  interest  in  Figure  22  suggests  that  instead  of  redueing  predietion  varianee  where 
all  responses  fall  within  the  ranges  of  interest,  the  algorithm  seleeted  eases  where  any 
response  fell  within  the  range  of  interest.  Put  another  way,  although  sample  2  was  not  expeeted 
to  fall  in  or  near  the  region  of  interest  for  Maeh  0.3  or  0.8,  it  was  of  great  interest  for  the  Maeh 
2.5  model. 

The  resulting  predietion  aeouraey  for  eaeh  response  for  oases  of  interest  (i.e.,  the  oases  in 
Figure  22d)  are  shown  in  Figure  22b,  23o  &  23d.  Note  that  after  the  ninth  &  tenth  samples  - 
i.e.,  the  fourth  and  fifth  adaptive  samples  after  the  five  initial  spaee-filling  oases  -  the 
predietion  aeouraey  for  eaeh  response  improved  dramatioally.  The  fifth  sample  was  very  olose 
to  the  region  of  interest,  resulting  in  drastio  reduotions  for  predietion  error. 
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(C)  (d) 

Figure  22:  Initial  Sampling  (a)  and  95  Percent  Prediction  Error  Quantiles  for  Cases  of  Interest  at  (b)  Mach 

0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 

The  distribution  of  samples  was  still  mueh  more  seattered  than  might  have  been  expeeted  given 
the  relatively  few  cases  which  were  of  interest  for  all  responses.  If  the  purpose  of  the  sampling 
were  to  identify  cases  of  interest  for  any  response,  this  behavior  would  be  acceptable.  For  the 
purpose  of  effective  vehicle  design,  however,  the  algorithm  must  emphasize  the  intersection  of 
the  regions  of  interest  rather  than  the  union.  A  research  engineer[49]  suggested  a  strategy  of 
excluding  candidate  points  based  on  a  probability  of  interest  score. 

At  a  particular  candidate  point,  the  probability  that  the  response  value  would  fall  within  the 
range  of  interest  could  be  calculated;  this  was  already  done  for  test  points  as  part  of  the 
weighting  function,  but  was  not  necessary  for  candidate  points  for  the  one-response  problem. 
Calculating  this  probability  for  candidate  points  provided  an  estimation  of  how  likely  the 
candidate  was  to  fall  within  the  region  of  interest  for  each  response.  Once  this  probability  had 
been  calculated  for  all  responses,  the  minimum  value  was  referred  to  as  the  “Probability  of 
Interest”  or  POI  for  that  candidate. 
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The  user  may  then  speeify  a  required  POI  value  before  the  algorithm  begins.  Any  eandidate 
point  with  a  POI  less  than  the  required  value  will  be  ignored  and  assigned  an  wIMSE  seore  equal 
to  the  highest  observed  value.  This  assigned  wIMSE  seore  designates  sueh  points  as 
uninteresting  to  the  algorithm,  ensuring  that  those  points  will  not  be  seleeted  for  sampling  in 
that  round. 

Care  must  be  taken  when  speeifying  a  required  POI  value.  Although  POI  ealeulations  are  based 
on  the  best  available  Kriging  surrogates  at  the  time,  those  surrogates  may  not  aeeurately  eapture 
the  true  response  behavior,  espeeially  when  very  few  samples  are  available.  The  estimated 
response  in  Eigure  1 1  has  only  passing  similarities  to  the  aetual  response  in  Eigure  9. 
Eurthermore,  this  sampling  approaeh  is  intended  for  use  when  resourees  will  only  allow  a 
limited  number  of  analyses,  inereasing  the  risk  that  the  surrogate  models  may  be  inaeeurate  in 
the  early  stages  of  sampling. 

It  is  tempting  to  demand  a  high  POI  value  for  eaeh  sample  to  be  sure  that  every  point  falls  in 
or  near  the  region  of  interest.  As  the  number  of  responses  grows  or  the  regions  of  interest 
shrink  compare  the  region  of  interest  at  Mach  2.5  in  Eigure  21c  to  that  at  Mach  0.3  in  Eigure 
21a  -  it  becomes  less  likely  that  any  candidate  will  meet  the  requirements,  especially  for 
large  design  spaces.  Eor  this  verification  problem,  a  POI  requirement  of  25  percent  is  enough  to 

disqualify  every  point  in  the  23  X  23  grid  of  candidates. 

If  no  candidate  meets  the  required  POI,  the  algorithm  will  select  the  candidate  with  the 
maximum  POI,  i.e.,  the  candidate  that  is  the  most  likely  to  be  in  a  region  of  interest  for  every 
response.  This  is  not  necessarily  a  bad  result,  given  that  the  point  is  likely  to  be  in  a  promising 
location,  but  if  no  candidates  meet  the  required  POI  then  their  effects  on  prediction  variance  are 
not  considered.  As  a  limiting  case,  a  candidate  might  fall  very  close  to  a  previous  sample 
which  was  in  the  region  of  interest.  This  candidate  will  probably  have  a  very  high  POI,  but 
may  be  so  close  that  it  will  not  provide  much  new  information  about  response  behavior. 
Another  candidate,  farther  away,  might  have  a  lower  POI  but  would  provide  a  much  larger 
reduction  in  prediction  variance. 

This  was  demonstrated  with  a  side  experiment,  in  which  the  POI  requirement  was  set  to  25 
percent.  Results  are  depicted  in  Eigure  23.  Eor  the  first  adaptive  sample  selection,  the 
candidate  with  the  highest  POI  was  the  one  closest  to  the  upper-left  space-filling  point.  This 
candidate  had  a  high  POI  value  because  that  training  point  was  the  closest  to  having  good 
performance  in  all  three  responses,  and  because  prediction  variance  was  relatively  low  in  the 
region  around  the  training  point.  The  candidate  point  which  was  selected,  being  extremely  close 
to  an  existing  sample,  did  not  provide  much  new  information  about  response  behavior  which 
made  it  difficult  for  the  algorithm  to  learn  from  its  mistake.  This  is  shown  clearly  in  Eigure  23a, 
where  the  algorithm  primarily  selected  samples  that  were  quite  close  to  the  existing  data  set. 

As  a  result,  the  prediction  error  for  cases  of  interest  improved  very  gradually.  The  average 
space-filling  set  of  samples  produced  a  surrogate  which  was  approximately  just  as  good  as 
the  surrogate  trained  on  contour-based  samples,  with  the  exception  of  the  Mach  2.5  case 
(Eigure  23d)  where  contour-based  sampling  still  offered  an  improvement.  In  fact,  the  model 
based  on  contour-  based  sampling  with  a  POI  requirement  of  25  percent  in  this  scenario  was  less 
accurate  than  when  no  POI  requirements  were  used  at  all  (Eigure  22).  A  more  tolerant  POI 
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setting  would  allow  the  algorithm  to  spread  its  samples  out,  and  as  a  result  the  aeeuracy  of  the 
Kriging  models  would  improve  mueh  more  rapidly. 


(c) 


(d) 


Figure  23:  Sampling  For  Required  POI  of  25%  (a)  &  95%  Prediction  Error  Quantiles  for  Cases  of  Interest 

at  (b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 

The  demonstration  was  repeated  with  a  small  POI  requirement  (1  percent),  as  shown  in  Figure 
24.  This  time,  most  of  the  samples  were  clustered  in  the  region  of  interest,  with  occasional 
forays  into  other  regions  that  might  be  promising.  The  prediction  accuracy  for  all  three 
responses  improved  more  rapidly  than  the  other  two  cases  shown. 

In  general,  a  low  POI  requirement  would  favor  the  selection  of  candidates  which  would  have  a 
large  effect  on  prediction  variance,  but  might  not  lie  in  a  region  of  interest  for  all  responses.  A 
high  POI  requirement  would  encourage  the  selection  of  cases  that  were  more  likely  to  lie  in  the 
region  of  interest,  but  might  be  less  beneficial  from  the  perspective  of  variance  reduction. 
Essentially,  the  POI  requirement  can  be  thought  of  as  a  means  for  the  user  to  express  a 
preference  between  exploration  and  exploitation. 
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0.8 


(c)  (d) 


Figure  24:  Sampling  For  Required  POI  of  1%  (a)  &  95^0  Prediction  Error  Quantiles  for  Cases  of  Interest 

at  (b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 

Other  POI  requirement  values  were  examined  for  this  problem  as  well.  The  results  illustrated  the 
eflfeet  of  the  POI  requirement  in  greater  detail.  The  three  examples  given  here  should  be 
sufficient  to  illustrate  how  the  parameter  affects  the  behavior  of  the  algorithm;  the  other  results 
may  be  found  in  Appendix  C: 

6.9.1  Evaluation  of  Accuracy 

The  results  of  these  verification  experiments  indicated  that  the  sample-selection  algorithm  was 
functioning  properly.  Based  on  the  available  information,  it  estimated  the  predicted  responses 
and  variances  throughout  the  input  space  and  selected  samples  which  would  reduce  prediction 
variance  in  specified  regions  of  interest.  This  reduction  in  prediction  variance  corresponded  to 
increased  prediction  confidence  and  accuracy  in  those  regions. 

For  problems  with  multiple  responses,  the  Probability  of  Interest  parameter  was  introduced. 
This  parameter  allows  the  behavior  of  the  algorithm  to  be  tuned  according  to  the  user’s 
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tolerance  for  samples  which  may  not  be  of  interest  for  all  responses.  Using  this  parameter, 
the  algorithm  was  demonstrated  to  effectively  and  eflhciently  select  samples  which  greatly 
improved  prediction  accuracy  in  the  region  of  interest  compared  to  space-filling  sampling. 

The  demonstrations  thus  far  have  been  done  for  a  two-dimensional  test  problem.  Probable 
applications  for  this  method  in  the  field  of  vehicle  design  will  likely  feature  many  more  free 
parameters.  The  final  test  of  the  sampling  algorithm,  and  by  extension  the  final  test  of 
Hypothesis  1,  would  feature  multiple  responses  as  well  as  a  larger  number  of  free  parameters. 

6.10  Larger  Multi-Response  Experiment:  Nine  Inputs,  Three 
Responses 

6.10.1  Selection  of  Additional  Free  Parameters  Using  Sensitivity 
Analysis 

The  free  parameters  for  this  experiment  were  selected  using  a  series  of  sensitivity  studies 
based  on  the  nested  Latin  hypercube  cases  that  were  analyzed  in  the  original  Reusable  Booster 
System  effort  by  ASDL  and  AFRL.  The  data  sets  analyzed  included  at  least  eleven  thousand 
space-filling  samples  at  each  flight  condition.  The  sensitivity  studies  were  performed  in  JMP,  a 
statistical  analysis  program  by  SAS  Software.  [91]  Each  study  attempted  to  identify  the  effects 
of  the  forty-nine  input  parameters  on  the  behavior  of  the  vehicle  pitching  moment. 

The  results  of  the  sensitivity  study  were  expressed  as  fractional  contribution,  the  portion  of  the 
observed  variation  in  the  response  was  expected  to  be  due  to  variation  of  each  input 
parameter.  These  fractions  were  then  averaged  together  to  identify  the  parameters  that  had  the 
most  influence  on  pitching  moment  for  these  flight  conditions.  The  top  ten  parameters  and  the 
average  sensitivity  are  listed  in  Table  3. 

Scale  was  a  parameter  that  set  the  overall  size  of  the  vehicle.  Its  presence  in  this  list  was 
unexpected  because  the  pitching  moment  coefficient  had  already  been  normalized  by  the 
reference  area.  A  change  in  the  Scale  parameter  was  equivalent  to  a  photographic  scaling  of 
the  vehicle,  which  should  not  affect  the  pitching  moment  coelficient. 

The  magnitude  of  the  sensitivity  results  held  the  solution.  The  parameters  could  be  grouped 
according  to  sensitivity  values:  parameters  2  &  3,  parameters  4  &  5  and  parameters  6  through 
10.  Many  other  parameters  (not  shown)  had  values  close  to  3  percent.  It  was  inferred  that, 
because  most  of  the  parameters  had  approximately  the  same  effect  on  the  response  variation,  the 
sensitivity  test  was  not  able  to  clearly  differentiate  between  those  parameters  given  the  data  set 
used.  A  larger  data  set  would  have  been  able  to  better  distinguish  sensitivity  effects,  but  further 
data  was  not  available. 

In  light  of  prior  knowledge  of  how  vehicle  scale  affects  the  response,  it  was  decided  after 
consultation  with  AFRL  to  omit  the  Scale  parameter  and  perform  the  sensitivity  study  using  the 
other  nine  parameters  listed  in  Table  3.  The  ranges  for  the  seven  new  variables  which  will  be 
included  are  given  in  Appendix  B. 
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Table  3:  Partial  Sensitivity  Study  Results 


Parameter  Name 

Sensitivity 

Rank 

Fractional 

Contribntion 

Root  Chord  Fraction 

1 

15.82% 

Fuselage  Radius  Fraction 

2 

9.69% 

Nose  Droop 

3 

8.53% 

Nose  Fineness  Ratio 

4 

6.31% 

Wing  Airfoil  Maximum  Camber 

5 

6.19% 

Wing  Half-Span  Fraction 

6 

3.66% 

Scale 

7 

3.62% 

Area  Ratio  of  Vertical  Tail  to  Wing 

8 

3.47% 

Top  Curvature  1 

9 

3.44% 

Top  Curvature  2 

10 

2.99% 

6.10.2  Infeasibility  of  Grid  Sampling  for  Test  Data 

Unlike  the  experiments  with  2  free  parameters,  a  grid  seareh  was  not  feasible  over  nine 
parameters.  For  the  two-parameter  experiments,  eaeh  variable  was  slieed  into  50  equal  spaces, 
resulting  in  2,500  cases  for  the  full  factorial  analysis.  Achieving  the  same  resolution  over  nine 

free  parameters  would  require  1.95  X  10'^  analyses.  Even  a  three-level  full  factorial  sampling 
would  require  3^,  or  nearly  twenty  thousand,  samples,  almost  ten  times  as  many  as  were  used 
for  the  two-parameter  experiments.  Any  grid  search  that  could  be  executed  in  a  feasible  time 
would  necessarily  have  very  low  resolution.  It  was  therefore  decided  that  for  this  experiment,  it 
would  be  too  computationally  expensive  to  pre-run  all  the  data  that  might  be  necessary.  Instead, 
each  case  would  be  analyzed  as  it  was  required. 

As  before,  the  algorithm  was  intended  to  preferentially  sample  cases  in  the  region  of  interest, 
where  pitching  moments  were  close  to  zero  at  all  flight  conditions.  Because  the  default 
settings  for  the  seven  new  parameters  were  all  within  their  current  ranges,  it  was  known  that 
some  cases  of  interest  exist  within  the  nine-dimensional  space.  Sixteen  such  cases  were  already 
identified  for  the  two-dimensional  experiments,  all  tightly  clustered.  More  cases  of  interest 
were  desired,  however;  the  more  cases  of  interest  available,  the  better  the  prediction  accuracy  of 
various  surrogate  models  can  be  evaluated. 
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6.10.3  Alternative  Approach:  Genetic  Algorithms 

At  this  stage  of  the  researeh  effort,  time  was  considered  more  valuable  than  computing  effort. 
HPCC  systems  were  available  to  analyze  a  large  number  of  cases  in  parallel,  and  it  had  been 
indicated  that  more  resources  could  be  allocated  to  the  effort  if  necessary.  Essentially,  the  per- 
analysis  cost  on  HPCC  systems  was  negligible.  The  ability  to  run  dozens  or  hundreds  of  cases 
simultaneously  allows  a  large  population  of  cases  to  be  evaluated  in  a  timely  fashion,  making 
genetic  algorithms  (GAs)  an  attractive  option. 

As  Grefenstette  states,  “[wjhile  classical  gradient  search  techniques  are  more  eflhcient  for 
problems  which  satisfy  tight  constraints,  GAs  consistently  outperform  both  gradient 
techniques  and  various  forms  of  random  search  on  more  difficult  (and  more  common)  problems, 
such  as  optimizations  involving  discontinuous,  noisy,  high-dimensional,  and  multimodal 
objective  functions.”[69]  Although  the  response  was  expected  to  be  continuous,  the  plot  of  Cm 
behavior  at  Mach  0.8  in  Figure  20  indicated  that  the  pitching  moment  at  that  flight  condition 
may  be  noisy.  The  relatively  complex  behavior  observed  in  the  two-dimensional  response  at 
Mach  0.3,  cr  15°  (see  again  Figure  20)  indicated  that  multi-modal  behavior  could  not  be  ruled  out. 
The  9  active  input  parameters  here  did  not  unambiguously  make  the  problem  “high-dimensional,” 
but  this  application  could  be  considered  a  warm-up:  if  the  genetic  algorithm  method  performed 
well  in  this  case,  the  same  method  would  likely  be  used  to  identify  test  cases  for  the  full-scale 
problem  which  has  49  dimensions.  In  light  of  these  observations,  a  genetic-algorithm-based 
approach  seemed  plausible. 

Although  GA  optimization  can  be  powerful,  the  method  has  its  share  of  drawbacks.  One  of  the 
negative  aspects  of  the  genetic  algorithm  approach,  the  inability  to  guarantee  that  a  true  optimum 
has  been  found,  was  immaterial  in  this  case:  it  need  only  find  cases  with  small  objective 
functions,  not  a  global  minimum.  Another  negative  is  that  the  technique  often  requires  a  high 
number  of  function  calls  relative  to  other  optimization  techniques.  This  was  of  no  consequence 
when  the  per-analysis  cost  was  considered  minimal  in  light  of  HPCC  resources.  The 
drawbacks  of  genetic  algorithms  were  relatively  minor  for  this  application.  A  review  of  the 
method  is  therefore  appropriate. 

6.10.4  Review  of  Genetic  Algorithms 

There  are  many  different  approaches  under  the  umbrella  of  Genetic  Algorithm  optimization, 
and  spatial  limitations  prevent  them  all  from  being  discussed  in  depth  in  this  document.  For  a 
more  thorough  review,  see  chapter  six  of  Holland’s  Adaptation  in  Natural  and  Artificial 
Systems. [811 

The  typical  GA  formulation  will  discretize  each  variable  into  a  binary  string.  The  number  of  bits 
in  the  string  describing  that  variable  will  determine  the  resolution  for  that  variable.  A  variable 
represented  by  one  bit  has  two  possible  values  (typically  the  minimum  and  maximum  values 
allowed);  two  bits  allows  the  encoding  of  four  values,  and  so  on.  The  binary  strings  for  each 
variable  were  concatenated  to  produce  a  single  binary  string,  called  a  “chromosome,”  that 
described  a  particular  case. 

Most  GA  algorithms  apply  two  operations  to  the  population  in  order  to  generate  the  (n  +  1 
population:  reproduction  and  mutation.  During  reproduction,  one  or  more  “parent” 
population  members  are  selected  and  their  objective  function  scores,  or  fitness  values, 
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evaluated.  The  member  with  the  better  score  continues  on  to  the  next  step,  which  may  involve 
direct  descent,  genetic  crossover,  or  a  combination  of  the  two.  Direct  descent  adds  the  member 
to  the  {n+\  population  as  a  “child”  without  any  changes,  which  ensures  that  good-scoring 
cases  will  remain  as  “breeding  stock”  in  the  next  round.  Genetic  crossover,  on  the  other  hand, 
usually  takes  two  “parent”  members  and  swaps  a  portion  of  the  “genes”  which  define  each 
case.  In  most  applications,  crossover  is  a  probable  but  not  certain  occurrence.  The  user 
determines  the  likelihood  that  crossover  will  occur,  with  typical  values  in  the  range  of  50-90 
percent  depending  on  the  population  size. [69] 

Mutation  is  the  final  stage  before  a  “child”  becomes  a  member  of  the  (n  +  1  population.  If 
mutation  occurs,  a  single  bit  in  the  child’s  chromosome  is  selected  and  flipped.  If  the  bit  had 
been  a  zero,  it  becomes  a  one  and  vice-versa.  The  mutation  rate  is  also  selected  by  the  user.  As 
Raczynski  describes  it,  mutation  is  applied  “to  introduce  traits  into  a  population  that 
otherwise  would  not  exist...  [sjince  reproduction  only  produces  offspring  that  are  based  on  the 
parents,  if  a  certain  value  in  the  chromosome  is  not  found  anywhere  in  the  parent  population, 
then  it  will  not  be  anywhere  in  the  oflfspring.”[158]  A  larger  mutation  rate  will  increase  the 
genetic  diversity  of  the  population  at  the  cost  of  a  reduced  convergence  rate  due  to  the 
randomness  that  is  introduced.  Common  mutation  rates  are  in  the  neighborhood  of  3-5 
percent.  [69] 

Before  a  GA-based  approach  could  be  applied,  however,  the  objective  function  by  which 
population  members  would  be  ranked  had  to  be  specified. 

6.10.5  Defining  the  Objective  of  the  Genetic  Algorithm 

When  multiple  responses  are  important,  an  overall  obj ective  function  may  be  constructed,  such  as 
by  a  weighted  sum. [29]  This  overall  objective  function  allows  each  population  member  to  be 
described  by  a  single  “fitness”  score  that  quantifies  its  desirability.  In  this  case,  a  simple 
additive  objective  function  was  constructed: 


N 

ObjFunc^Y?^,\C^,\  (17) 

i=i 

Here,  w,  is  the  individual  weighting  function  for  flight  condition  i,  and  ICmjI  is  the  absolute 
value  of  the  pitching  moment  coefficient  at  flight  condition  i.  N  is  the  total  number  of  flight 
conditions  being  considered. 

The  goal  was  to  promote  cases  with  |  0. 1 ;  once  this  was  achieved  it  was  not  important  to 

drive  the  pitching  moment  at  that  flight  condition  to  be  smaller.  Toward  this  end,  a  conditional 
weighting  was  applied: 


Wj  = 


10  if|C^,|>0.1 
1  otherwise 


(18) 


As  I  I  decreases,  this  weighting  function  results  in  large  objective  function  reductions  (i.e., 
improvements).  This  holds  true  until  \Cm\  falls  below  the  threshold  of  interest  (0.1),  after 
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which  any  further  reduction  in  that  pitching  moment  produees  only  a  fraetion  of  the  previous 
rewards.  The  weighting  function  was  applied  equally  to  every  flight  eondition. 

6.10.6  Tailoring  a  Genetic  Algorithm  for  the  Current  Application 

A  custom  GA  was  synthesized  for  this  applieation.  Most  GAs  retain  a  “memory”  of  only  one  or 
two  populations.  It  was  expeeted  that  any  desirable  population  members  will  be  retained  in  the 
aetive  population,  and  indeed  direet  deseent  explieitly  tries  to  make  sure  that  the  best  population 
members  appear  in  the  next  population.  For  this  applieation,  it  was  deeided  that  it  was  not 
worthwhile  to  spend  time  re-evaluating  a  case  that  had  previously  been  analyzed.  Instead,  a 
history  of  observed  fitness  seores  was  maintained. 

Eaeh  time  a  population  was  analyzed,  its  members  and  their  assoeiated  objeetive  funetion  seores 
would  be  added  to  this  population  history.  This  population  history  would  then  be  sorted  by 
objeetive  function,  and  the  five  hundred  eases  with  the  best  objective  seores  would  be  pulled 
out  as  “breeding  stoek.”  These  eases  would  then  be  subjeeted  to  erossover  and  mutation  to 
produee  the  next  population  for  analysis. 

For  every  CFD  analysis,  there  was  a  small  but  non-zero  risk  that  the  analysis  would  fail  to 
produee  a  result.  The  geometrie  utility  might  not  build  the  outer  mold  line  properly,  the  flow 
solver  might  not  eonverge  adequately,  or  it  might  eonverge  to  a  nonsensieal  result.  Efforts 
were  made  to  mitigate  these  risks,  as  deseribed  in  Appendix  B,  but  the  chanee  of  a  failed  case 
eould  not  be  entirely  negated.  This  risk  was  eompounded  by  the  faet  that  eaeh  case  was 
simulated  for  multiple  flight  eonditions.  If  a  case  failed  to  eonverge  to  a  plausible  result  at  any 
flight  eondition,  that  case  was  omitted  from  the  list  of  results.  To  avoid  re-running  sueh  a  ease, 
a  list  of  attempted  eases  was  maintained,  separate  from  the  list  of  finished  eases. 

Thus,  the  GA  implemented  for  this  applieation  proeeeded  as  follows:  Eirst,  the  total  set  of  ob¬ 
served  results  was  sorted  based  on  objective  function  value,  and  the  best  five  hundred  results 
were  used  as  “breeding  stoek.”  Two  members  from  this  set  were  seleeted  and  their  objeetive 
funetions  compared.  The  member  with  the  better  objeetive  seore  was  retained.  A  random 
number  was  generated  on  the  range  from  0  to  1 ,  and  if  the  value  was  less  than  the  likelihood 
of  erossover  (70  pereent),  erossover  oeeurred. 

When  erossover  oeeurred,  another  population  member  was  seleeted  from  the  population,  and 
“starting”  and  “stopping”  loeations  on  the  ehromosome  were  pieked  via  random  number 
generation.  The  genes  between  these  two  indiees  on  parent  A  were  transposed  to  the  same 
loeations  on  parent  B,  while  the  same  genes  on  parent  B  were  moved  to  parent  A.  This 
produeed  two  new  eases,  each  of  which  continued  the  proeess. 

Whether  or  not  erossover  oeeurs,  there  was  a  ehanee  for  eaeh  ease  to  experienee  mutation. 
Eor  this  applieation,  a  wide  variety  of  results  was  more  desirable  than  a  rapid  eonvergenee  of 
the  method;  additionally,  although  exeessive  mutation  might  in  some  applieations  introduee  so 
mueh  randomness  that  progress  is  impossible,  the  retention  of  every  observed  result  mitigated 
this  eoneern.  A  relatively  large  mutation  rate  of  10  pereent  was  therefore  used.  A  random 
number  was  generated  on  the  range  from  0  to  1  and  eompared  to  the  mutation  rate.  If  the  random 
number  was  less  than  the  speeified  mutation  rate,  mutation  oeeurred  and  one  random  bit  in  the 
ehromosome  was  flipped. 
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After  the  mutation  operator,  the  ease  was  finished  and  ready  to  enter  the  new  population.  Before 
this  happens,  it  was  eompared  against  the  list  of  cases  that  had  been  attempted  previously,  as 
well  as  the  cases  that  were  already  in  the  new  population.  If  the  case  matched  a  pre-existing  one, 
it  was  discarded;  if  not,  it  was  added  to  the  new  population. 

This  process  was  repeated  until  the  new  population  contained  five  hundred  new,  unique  cases. 
All  members  of  the  new  population  were  then  added  to  the  list  of  pre-existing  cases  to  ensure 
that  they  would  not  be  duplicated  in  a  future  population.  Note  that  ordinarily,  there  would  be 
some  chance  that  an  old  population  member  could  be  added  to  the  new  population 
unchanged  (27  percent  chance,  for  the  specified  probability  values).  All  such  cases  were 
rejected  by  this  formulation,  and  only  population  members  which  experienced  crossover 
and/or  mutation  would  enter  the  new  population.  As  a  result,  the  proportion  of  “children” 
which  experienced  crossover  or  mutation  was  higher  than  70  percent  and  10  percent, 
respectively.  In  tests,  these  proportions  were  commonly  closer  to  95  percent  and  13  percent. 
Once  again,  these  crossover  and  mutation  rates  were  significantly  higher  than  the 
recommended  values  cited  by  Grefenstette.[69]  However,  Grefenstette  assumes  an  algorithm 
that  has  no  memory  beyond  the  current  population,  and  in  particular  warns  that  high 
mutation  rates  risk  introducing  so  much  noise  that  progress  could  be  lost  between 
generations.  In  the  current  application,  each  new  generation  was  based  on  not  only  the 
generation  immediately  preceding  it  but  the  best  results  ever  observed,  so  the  risk  of  losing  a 
good  “bloodline”  was  not  present. 

6.10.7  Results  of  Genetic  Algorithm  Search 

Each  of  the  nine  parameters  was  described  using  an  eight-bit  string,  allowing  256  possible 
values  for  each  parameter.  The  algorithm  was  initialized  with  a  five  hundred  point  space¬ 
filling  design.  Once  those  cases  had  been  analyzed  at  each  flight  condition  using  CartSD,  the 
overall  objective  function  defined  by  Equations  17  &  18  was  used  to  calculate  a  score  for  each 
case.  The  cases  were  then  subjected  to  the  GA  algorithm,  which  produced  a  set  of  five  hundred 
new  cases  for  analysis.  A  total  of  twenty- five  batches  were  selected  by  genetic  algorithm. 

Those  batches  plus  the  initial  space-filling  set  totaled  13,000  cases,  of  which  1,470  had 
I  Cj^  \<  0.1  at  every  flight  condition,  a  success  rate  just  over  1 1  percent.  This  was  deemed  a 
sufficient  quantity  of  test  data. 

Now  that  a  block  of  cases  of  interest  was  available,  the  sampling  experiment  could  begin.  The 
hypothesis  being  tested  was  that: 

Contour-based  sampling  will  balance  the  selection  of  cases  with  good  performance  and  the 
reduction  of  prediction  uncertainty  in  promising  regions,  identifying  samples  that  efficiently 
improve  surrogate  accuracy  for  configurations  with  small  aerodynamic  moments. 

To  test  this  hypothesis,  a  set  of  space-filling  samples  and  a  set  of  samples  selected  by 
contour-  based  sampling  (CBS)  were  collected  and  surrogate  models  trained  using  each.  The 
accuracy  of  the  resulting  models  was  then  compared.  The  test  would  depend  on  prediction 
accuracy  for  cases  of  interest.  This  test  would  quantify  the  prediction  error  using  the  cases  of 
interest  identified  by  the  genetic  algorithm  as  detailed  above.  This  prediction  error  would  be 
quantified  using  Root  Mean  Square  Error  (RMSE). [85] 
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It  should  be  emphasized  that  this  hypothesis  did  not  elaim  some  absolute  level  of  performanee, 
but  rather  a  relative  improvement  in  performanee.  The  expected  result  was  that  the  CBS- 
based  surrogates  would  produce  smaller  RMSE  values  for  a  given  number  of  samples  and/or 
equal  RMSE  values  for  a  smaller  number  of  samples.  If  the  CBS-based  surrogates  were  found 
to  achieve  this  objective,  this  hypothesis  would  be  considered  to  be  supported. 

Before  contour-based  sampling  could  be  applied,  an  initial  set  of  samples  was  required  to  create 
the  initial  surrogate  models  for  candidate  evaluation.  It  was  decided  that  this  initial  set  of 
samples  would  be  used  by  both  the  space-fdling  and  the  adaptively-sampled  approaches  to 
eliminate  the  risk  that  one  or  the  other  method  might  benefit  from  a  particularly  lucky  or  unlucky 
sample  distribution.  An  initial  sample  size  of  five  hundred  cases  was  selected  as  being  large 
enough  to  roughly  approximate  the  behavior  of  the  response  while  still  leaving  ample  room 
for  improvement.  All  Kriging  models  in  this  experiment,  both  for  space-filling  and  adaptively- 
sampled  cases,  would  use  quadratic  fits  and  Gaussian  correlation  models. 

In  order  to  generate  a  small  space-filling  set  of  cases  that  could  be  augmented  without  losing 
the  space-filling  properties,  a  nested  Eatin  hypercube[I52]  was  selected  to  generate  the  space¬ 
filling  sample  set. 

6.11  Null  Hypothesis:  Space-Filling  Samples 

The  nested  hypercube  for  this  effort  contained  multiple  space-filling  subsets  ranging  from  500 
to  16,000  cases,  progressively  doubling  in  size.  The  maximum  size  was  intended  to  be 
substantially  more  than  the  experiment  would  require  to  minimize  the  risk  that  additional 
space-filling  cases  would  be  needed.  It  was  expected  that  this  experiment  would  use  Kriging 
models,  which  become  unwieldy  when  applied  to  a  pool  of  more  than  a  few  thousand  points,  so 
the  null  hypothesis  would  primarily  depend  on  the  hypercubes  of  size  500,  1,000,  2,000  and 
4,000  cases.  The  remaining  cases  would  be  available  if  larger  sample  sizes  became  necessary. 
This  was  not  considered  likely,  but  the  marginal  cost  for  the  extra  capacity  was  negligible  and 
would  be  a  useful  resource  if  experimentation  revealed  that  more  samples  would  be  necessary. 

Of  these  prepared  cases,  only  4,000  were  deemed  necessary  and  analyzed. 

Because  these  samples  could  be  selected  simultaneously  without  any  knowledge  of  the  response 
behavior,  these  analyses  were  comparatively  simple  to  prepare.  Parameter  values  for  each 
case  were  passed  to  the  PaceEab  tool, [6 1, 142]  which  defined  the  vehicle  outer  mold  line  (OME) 
using  a  triangular  mesh  that  could  be  interpreted  by  the  flow  simulation  software.  Because  all 
cases  for  this  set  were  selected  in  advance,  the  analysis  required  little  oversight  by  the 
experimenter. 

Although  samples  were  quite  easy  to  select,  not  every  case  ran  successfully.  Cases  were 
excluded  from  the  final  data  sets  for  a  variety  of  reasons.  These  reasons  included  ditficulties 
creating  the  surface  mesh,  a  lack  of  convergence  in  the  flow  solver  results,  or  very  rarely 
convergence  to  nonsensical  results  such  as  negative  drag.  More  information  about  these 
difficulties,  as  well  as  efforts  to  mitigate  them,  may  be  found  in  Appendix  D.  If  a  case  did 
not  produce  well-behaved,  numerically-converged  results  at  every  flight  condition,  that  case 
was  excluded  from  the  final  data  set.  As  a  result,  roughly  70  percent  of  the  space-filling  cases 
were  included  in  the  final  data  set. 
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Once  the  analysis  results  for  the  eommon  eore  of  spaee-filling  data  were  available,  eontour- 
based  sampling  eommeneed. 

6.11.1  Alternative  Hypothesis:  Contour-Based  Sampling 

The  500  spaee-filling  points  from  the  nested  Latin  hypereube  were  used  as  the  initial  data  set  for 
the  eontour-based  sampling  algorithm.  Of  those  500  eases,  370  eonverged  sueeessfully  at  all 
flight  eonditions.  Before  the  sampling  algorithm  eould  be  applied  in  earnest,  a  handful  of 
teohnieal  deeisions  had  to  be  made. 

In  the  adaptive-sampling  literature,  it  is  typieally  assumed  that  after  the  most  desirable  sample 
is  seleeted,  that  response  value  or  values  for  that  sample  are  available  immediately.  To  faeilitate 
this,  researehers  demonstrating  adaptive  sampling  algorithms  often  either  use  analytieal 
funetions  as  responses  or  generate  a  large  pool  of  data  in  advanoe.[l  13,  149,  159]  In  the  present 
applieation,  the  aerodynamie  analysis  was  being  performed  on  remote  eomputing  systems,  and 
the  authentieation  proeedures  required  to  aeeess  those  systems  did  not  allow  automated  login 
or  file-transfer  without  an  authorized  human  in  the  loop.  [200]  This  limitation  made  it  impraetieal 
to  analyze  eaeh  sample  as  it  was  seleeted.  As  for  pre-generated  data,  although  that  approaeh  was 
used  for  the  two-dimensional  demonstrations  of  eontour-based  sampling,  the  present  nine¬ 
dimensional  spaee  was  deemed  too  large  for  full  a  priori  sampling. 

An  alternative  was  sought  to  analyzing  eaeh  ease  as  it  was  identified.  Maekman  and  Allen 
deseribe  a  modeling  approaeh  in  whieh  multiple  samples  are  taken  without  updating  the 
surrogate  model.[l  12]  This  approaeh  was  adopted  and  modified  slightly  to  refleet  differenees  in 
the  sampling  objeetives:  Maekman  and  Allen  seek  samples  in  regions  of  response 
nonlinearity  and  regions  of  low  sample  density.  Their  deeision  to  not  update  the  surrogate 
model  aflfeets  the  ealeulation  of  nonlinearity  but  not  sample  density,  and  so  even  without 
updating  their  surrogate  there  is  no  risk  that  any  sample  would  be  plaeed  too  elose  to  the  others. 

Contour-based  sampling,  on  the  other  hand,  used  Kriging  models  to  estimate  the  response  and 
prediction  variance  at  a  given  point;  if  a  sample  was  not  added  to  the  model,  the  estimate  for 
predietion  varianee  would  not  refleet  the  eflfeets  of  that  sample,  and  there  was  a  risk  that 
multiple  samples  would  be  seleeted  in  the  same  region  -  samples  whieh  might  have  been 
more  effectively  plaeed  elsewhere.  To  avoid  this,  eaeh  sample  had  to  be  added  to  the  Kriging 
model  onee  seleeted  so  that  later  ealeulations  would  properly  estimate  predietion  varianee. 
There  was  a  mild  eomplieation  due  to  the  faet  that  the  aetual  response  value  at  that  point  eould 
not  be  known  until  after  it  was  analyzed;  this  was  addressed  by  estimating  the  response  at  that 
point  using  the  eurrent  Kriging  model,  adding  the  point  and  its  estimated  response  to  the  Kriging 
model,  and  updating  the  Kriging  model  onee  the  aetual  response  value  was  known.  A  set  of 
oases  seleeted  between  CFD  analyses  was  referred  to  as  a  “batoh.” 

This  deeision  raised  a  new  question:  how  many  samples  should  be  seleeted  before  they  are 
uploaded  for  analysis?  Whenever  new  analysis  results  were  added  to  the  data  pool,  surrogate 
models  eould  be  re-trained  to  refleet  the  most  up-to-date  information.  Analyzing  eaeh  sample 
immediately  would  mean  that  all  subsequent  sample  seleotion  would  be  based  on  the  best 
information  possible.  Conversely,  delaying  sample  analysis  and  using  the  eurrent  surrogate 
models  as  stand-ins  meant  that  the  surrogates  would  not  be  updated  as  often,  and  thus  would  not 
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be  as  accurate  as  they  might  have  been.  As  a  result,  later  sample  selections  would  be  based  on 
information  that  was  not  as  accurate  as  it  might  have  been. 

A  larger  number  of  samples  per  batch  would  result  in  decreased  human  workload  but  might 
reduce  the  overall  effectiveness  of  the  sampling  algorithm.  For  this  experiment,  it  was  decided 
that  batches  of  50  cases  ought  to  be  an  effective  compromise  between  human  workload  and 
losses  due  to  imperfect  information.  In  light  of  the  70  percent  success  rate  observed  with  the 
space-fdling  cases,  selecting  70  cases  per  batch  was  expected  to  result  in  50  useful  results.  Each 
batch  would  therefore  be  comprised  of  70  cases.  Attention  then  turned  to  selecting  values  for 
the  free  parameters  of  the  algorithm. 

The  sampling  algorithm  has  various  free  parameters  that  can  be  adjusted  to  match  the  user’s 
preference.  These  free  parameters  include  the  required  Probability  of  Interest  (POI),  the 
number  of  candidate  points  being  evaluated  in  each  round,  and  the  number  of  test  points  used 
in  the  evaluations.  These  parameters  interact  to  determine  the  behavior  of  the  algorithm. 

As  mentioned  in  Section  4.3,  a  larger  number  of  test  points  would  give  the  algorithm  a  larger 
set  of  options  from  which  to  choose  the  next  point,  but  would  result  in  increased 
computational  effort  per  selection.  A  larger  number  of  test  points  would  provide  a  more 
complete  understanding  of  how  each  candidate  would  affect  prediction  variance  throughout  the 
space,  but  would  also  result  in  increased  computational  effort.  A  higher  POI  requirement  would 
exclude  more  candidates  from  consideration,  reducing  the  effort  required  to  select  a  new  point, 
but  might  handicap  the  algorithm’s  ability  to  explore  the  design  space  (as  shown  in  Section  4.9). 
If  the  computing  resources  are  fixed,  it  might  be  more  useful  to  translate  the  required 
computational  effort  per  sample  selection  into  terms  of  time  elapsed  per  sample  selection.  The 
processing  times  that  are  cited  in  this  section  were  from  tests  using  two  parallel  quad-core  Intel 
2.83  GHz  processors  and  4  gigabytes  of  RAM. 

Each  case  was  selected  from  an  800-point  Eatin  hypercube  of  candidate  points;  the  candidates 
were  evaluated  using  a  separate  1 ,200-point  Eatin  hypercube.  New  hypercubes  were  generated 
each  time  a  new  point  was  selected,  as  suggested  in  the  original  paper  by  Picheny  et  al.[149]  to 
reduce  the  risk  that  any  important  portion  of  the  design  space  is  neglected  during  the  sampling 
process. 

Within  each  batch,  the  required  POI  was  varied.  In  particular,  a  more-restrictive  required  POI 
was  used  for  early  samples,  encouraging  the  algorithm  to  prioritize  the  most  promising  regions, 
and  this  value  was  progressively  reduced  to  allow  a  wider  variety  of  candidates  to  be 
considered.  This  approach  was  intended  to  maximize  the  effectiveness  of  the  batch  as  a  whole. 
In  Section  4.9  it  was  demonstrated  that  when  the  POI  requirement  is  high,  the  algorithm  tended 
to  select  cases  closer  to  regions  that  it  had  already  sampled. 

In  the  initial  batches,  the  first  5  rounds  were  given  a  required  POI  of  10  percent.  The  required 
POI  was  then  progressively  reduced  to  a  minimum  of  1  percent.  The  per-case  selection  times 
ranged  from  around  3  minutes  for  the  highest  POI  to  around  2  hours  for  the  lowest  POI.  When 
the  effects  of  these  settings  were  investigated,  it  was  found  that  even  for  repeated  samplings,  no 
candidates  were  found  to  have  a  POI  value  exceeding  9  percent. 
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A  set  of  candidates  was  investigated  in  order  to  determine  the  reason  for  these  low  probability 
of  interest  scores.  Recall  that  to  calculate  POI  score  for  a  candidate,  all  predicted  responses 
for  that  candidate  were  considered.  Based  on  the  predicted  response  values  and  the  prediction 
variance  in  each  response,  it  was  possible  to  calculate  the  likelihood  that  the  actual  response 
values  for  that  candidate  fell  within  the  specified  ranges  of  interest.  This  likelihood  was 
calculated  for  each  response,  and  the  lowest  likelihood  value  was  the  probability  of  interest  for 
that  candidate. 

It  was  found  that  the  Mach  0.3  pitching  moment  was  the  response  least  likely  to  fall  within  the 
region  of  interest  for  almost  every  candidate  that  was  investigated.  The  candidate  with  the 
highest  POI  score  (8.56  percent)  was  predicted  to  have  a  Cm  at  Mach  0.3,  AoA  0°  of  -0.06, 
which  is  definitely  within  the  range  of  interest.  The  estimated  variance  for  that  prediction, 
however,  was  0.86  -  very  large  compared  to  the  range  of  interest,  which  had  a  width  of  0.2. 
Thus,  even  though  candidates  were  predicted  to  have  attractive  response  values,  the  prediction 
confidence  was  so  low  that  every  candidate  received  low  POI  scores. 

There  were  two  possible  interpretations  for  this  behavior.  The  first  interpretation  was  that  the 
Kriging  surrogate  for  Mach  0.3  had  low  confidence  in  the  region  of  interest  because  it  had  not 
yet  sampled  those  regions  adequately  yet,  and  candidates  with  higher  POI  scores  could  not  be 
generated  until  more  analysis  was  completed.  The  second  interpretation  was  that  the 
candidate  points  that  were  being  analyzed  were  sampling  the  space  too  sparsely,  and  more- 
attractive  candidates  could  be  identified  by  a  denser  sampling. 

To  determine  which  interpretation  was  correct,  the  number  of  candidate  points  was  increased 
from  800  to  3,000  for  the  fifth  batch  of  samples.  The  number  of  test  points  remained  1,300. 
Once  again,  no  candidates  were  identified  to  have  POI  scores  above  10  percent.  To  avoid  a 
four-fold  increase  in  sample  selection  time,  the  POI  requirements  were  increased  until  roughly 
the  same  number  of  candidates  would  meet  the  requirements.  The  number  of  candidates  was 
briefly  increased  to  9,000,  and  again  no  candidates  were  found  to  have  high  POI  values. 

Based  on  this  evidence,  it  appeared  that  high  prediction  variance  for  the  Mach  0.3  flight 
condition  was  driving  the  low  POI  scores,  even  for  relatively  dense  distributions  of  samples.  In 
universal  Kriging  models,  the  prediction  variance  is  in  part  driven  by  the  ability  of  the 
underlying  linear  or  quadratic  model  used  in  the  predictor  to  replicate  the  data  points.  When 
the  underlying  model  captures  response  behavior  well,  the  prediction  variance  will  be  low; 
when  the  observed  data  is  not  well-matched  by  the  underlying  model,  the  prediction  variance 
will  be  high,  indicating  that  the  response  is  more  likely  to  be  far  from  the  prediction  of  the 
underlying  model. 

Going  back  to  Equation  4  on  page  81,  it  should  be  noted  that  the  process  variance  is  an 
important  factor  in  the  estimation  of  prediction  variance.  This  is  particularly  clear  when  it  is 
known  that  c{x)  and  C,  the  covariance  vector  and  matrix,  may  also  be  written  as  the  product  of 
the  process  variance  and  the  correlation  vector  and  matrix,  respectively.  Thus,  the  prediction 
variance  estimate  is  directly  proportional  to  the  process  variance.  This  process  variance  is 
estimated  by  the  DACE  toolbox  when  fitting  a  model  to  each  response.  When  the  estimated 
process  variance  was  investigated  for  the  space-filling  and  CBS-based  models,  it  was  found  that 
the  average  estimated  process  variance  value  for  Mach  0.3  Cm  models  was  1.36,  while  for  Mach 


92 

Approved  for  public  release;  distribution  unlimited 


0.8  and  2.5  Cm  models  the  values  were  0. 13  and  0.30,  respectively.  This  indicated  that  the 
Kriging  models  for  Mach  0.3  Cm  were  significantly  more  dependent  on  correlation  effects  to 
match  the  observed  responses,  rather  than  the  underlying  quadratic  model,  and  thus  only 
candidates  very  close  to  existing  samples  would  have  small  prediction  variances. 

After  these  observations  were  made,  it  was  accepted  that  increasing  the  number  of  candidates 
evaluated  was  not  likely  to  result  in  more  attractive  candidates,  and  the  number  of  candidate 
and  test  points  were  left  at  3,000  and  1,300,  respectively.  Ultimately,  10  batches  of  adaptive 
sampling  cases  were  selected  and  analyzed.  The  results  were  then  fit  with  Kriging  models  and 
the  predictive  accuracy  of  those  models  quantified. 

The  various  models  would  be  evaluated  using  the  cases  of  interest  identified  via  genetic 
algorithm,  as  described  in  Section  4. 10.6.  This  data  set  was  comprised  of  1,471  cases,  each  of 

which  exhibited  pitching  moments  within  ±0.1  at  all  3  flight  conditions.  The  hypothesis  being 
tested  was  that: 

Contour-based  sampling  will  balance  the  selection  of  cases  with  good  performance  and  the 
reduction  of  prediction  uncertainty  in  promising  regions,  identifying  samples  that  efficiently 
improve  surrogate  accuracy  for  configurations  with  small  aerodynamic  moments. 

In  order  to  consider  this  hypothesis  supported,  then,  the  Kriging  models  based  on  samples 
selected  by  contour-based  sampling  should  be  more  accurate  -  that  is,  should  have  less 
prediction  error  -  than  models  based  on  space-filling  cases. 

To  test  this,  Kriging  models  were  created  using  the  different  sets  of  points.  After  ten  rounds  of 
contour-based  sampling,  the  adaptive  data  set  contained  a  total  of  913  cases:  the  initial  500- 
point  space-filling  design,  of  which  363  or  73  percent  met  all  convergence  requirements,  plus 
550  adaptive  points,  or  79  percent  of  the  selected  cases.  As  for  the  nested  Latin  hypercube 
which  served  as  the  space-  filling  sampling,  a  number  of  training  data  sets  were  available.  The 
first  set  which  was  larger  than  the  largest  adaptive  set  was  the  2,000-point  hypercube,  of  which 
1 ,387  cases  or  69  percent  met  all  convergence  criteria. 

As  a  result,  the  space-filling  data  sets  used  to  test  this  hypothesis  were  the  1,000-  and  2,000- 
point  hypercubes.  The  500-point  hypercube  was  shared  by  both  methods  as  common  ground. 
The  adaptive  data  sets  included  the  shared  hypercube  points,  and  each  adaptive  data  set  was 
added  in  turn  to  illustrate  how  each  batch  of  samples  affected  prediction  accuracy.  Kriging 
models  were  built  using  each  data  set  in  turn,  and  then  the  predictive  accuracy  of  every  model 
was  tested. 

6.11.2  Evaluating  Predictive  Accuracy 

To  evaluate  the  accuracy  of  each  surrogate  model,  it  was  used  to  predict  the  three  response 
values  (i.e.,  pitching  moments)  for  every  case  of  interest  in  the  test  set.  The  actual  values 
were  then  subtracted  from  the  predicted  values  to  obtain  prediction  errors.  The  prediction 
errors  for  each  response  were  then  used  to  calculate  the  Root  Mean  Squared  Error,  or  RMSE, 
for  that  response  using  Equation  15. [85] 
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Figure  25  shows  the  RMSE  seores  for  each  combination  of  model  and  response.  The  models 
based  on  contour-based  sampling  (CBS)  are  represented  with  circular  icons,  while  models  based 
on  the  space-filling  nested  Latin  hypercube  (NLHC)  samples  are  represented  with  square  icons. 
Figure  25a  shows  the  RMSE  values  for  the  first  flight  condition,  Mach  0.3  cr  0°.  It  is  clear  that 
models  based  on  CBS  cases  have  improved  predictive  accuracy  compared  to  the  models  based 
on  NLHC  cases.  For  an  equal  number  of  samples  contour-based  sampling  could  produce  an 
RMSE  10  percent  less  than  the  1,000-point  NLHC  training  set.  Alternatively,  to  achieve 
equivalent  prediction  accuracy  for  this  response,  contour-based  sampling  required  only  one 
batch  of  samples,  producing  an  RMSE  of  0.630  using  416  cases  versus  the  space-filling  RMSE 
of  0.643  using  710  cases.  This  would  be  a  savings  of  41  percent  of  the  space-fdling  sample 
evaluations. 

Note,  however,  that  the  last  few  CBS  models  exhibited  a  shallower  slope  than  that  of  the  NLHC 
models.  If  this  trend  were  to  hold  constant,  the  2,000-point  NLHC  model  would  out-perform 
an  equally-sized  CBS  model.  Twice  in  the  early  rounds  -  during  the  first  and  fourth  rounds, 
specifically  the  prediction  accuracy  increased  relatively  sharply.  No  such  jumps  occurred  after  the 
fourth  round  of  sampling. 

Interestingly,  the  fifth  round  of  adaptive  sampling  was  when  both  the  number  of  candidates  and 
the  required  POI  were  increased.  This  is  suggestive,  especially  in  light  of  the  observation 
from  Section  4.9  that  the  POI  requirement  sometimes  causes  the  algorithm  to  stay  close  to 
known  points  rather  than  exploring  the  space.  It  is  possible  that  in  this  case,  the  increased  POI 
requirements  which  were  intended  to  keep  the  sample  selection  time  within  acceptable  limits 
had  a  secondary  effect:  those  requirements  may  also  have  constrained  the  algorithm  to  only 
evaluate  relatively  conservative  candidates,  those  which  fell  quite  close  to  existing  samples. 
These  candidates  had  a  high  likelihood  of  falling  within  the  region  of  interest,  indicated  by 
their  high  POI  scores,  but  because  they  were  close  to  known  cases  the  algorithm’s 
understanding  of  response  behavior  did  not  progress  very  rapidly.  This  is  an  intriguing  idea, 
and  further  study  may  be  worthwhile. 

In  the  meantime.  Figure  25b  shows  the  second  flight  condition,  Mach  0.8  or  0°.  In  this  case, 
the  1,000-case  NLHC  model  was  less  accurate  than  the  500-case  model.  It  would  seem  that, 
by  attempting  to  model  this  set  of  training  cases  as  accurately  as  possible,  the  resulting  Kriging 
model  actually  did  a  worse  ioh  of  predicting  the  region  of  interest.  The  2,000-case  NLHC 
model  offered  predictive  accuracy  that  is  better  than  both  smaller  NLHC  models. 

The  CBS  models  showed  an  even  faster  rate  of  improvement  than  for  the  previous  response. 
After  ten  rounds,  the  prediction  RMSE  score  had  improved  by  one-third  compared  to  the 
original  500-point  NLHC  model.  This  is  not  to  say  the  progress  was  smooth:  the  third  CBS 
model  actually  had  slightly  worse  prediction  accuracy  than  the  preceding  model.  This  hiccup 
is  quickly  rectified  though.  Note  that  again,  after  the  fourth  batch  of  samples,  the  improvement 
in  the  CBS  models  was  consistent  but  somewhat  slow  relative  to  the  progress  that  was  made  by 
the  earlier  batches. 
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Figure  25:  Prediction  Accuracy  for  Cases  ofinterest:  RMSEsat  (a)  Mach  0.3,  a  15°;  (b)  Mach  0.8,  Cf  0°; 

&  (c)  Mach  2.5,  a  0° 

After  the  fourth  batch  of  adaptive  cases,  the  CBS  models  were  more  accurate  than  the  2,000- 
point  NLHC  model  despite  using  only  556  samples  compared  to  the  1,387  of  the  NLHC  set. 
Put  another  way,  equivalent  accuracy  -  CBS  RMSE  of  0.463  compared  to  NLHC  RMSE  of 
0.47 1  -  was  achieved  using  less  than  half  the  number  of  samples. 

Lastly,  Figure  25c  illustrates  the  third  flight  condition,  Mach  2.5  cr  0°.  The  NLHC  models 
exhibited  smooth,  steady  improvement  as  the  number  of  samples  was  increased.  The  CBS 
models,  on  the  other  hand,  exhibited  behavior  that  was  somewhat  less  than  intuitive.  The  very 
first  batch  of  samples  resulted  in  a  markedly  worse  predictive  accuracy,  and  it  took  3  batches  of 
samples  until  the  predictive  accuracy  was  as  good  as  before  contour-based  sampling  began.  It 
is  believed  that  the  first  batch  of  CBS  cases,  while  demonstrably  beneficial  for  the  accuracy  of 
the  other  models,  resulted  in  a  less-accurate  model  for  the  third  response.  It  was  gratifying  to 
see  that  the  algorithm  was  error-tolerant,  rectifying  the  early  inaccurate  models  and  still  offering 
better  accuracy. 

The  space-filling  models  showed  approximately  linear  improvement  as  more  cases  were 
included.  The  adaptive  models  showed  what  was  roughly  a  linear  trend  with  a  steeper  slope,  but 
some  suggestion  of  becoming  less  steep  in  the  later  rounds.  There  was  a  slightly-larger-than- 
average  improvement  between  the  fifth  and  sixth  batches.  This  indicated  that  increased  POI 
requirements  for  these  batches  did  not  negate  the  chance  of  a  breakthrough,  although  those 
chances  might  have  been  restricted  somewhat. 
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In  general,  it  was  seen  that  contour-based  sampling  did  indeed  produce  surrogate  models  that 
had  better  predictive  accuracy  than  those  based  on  a  priori  space-filling  sampling.  For  a  given 
number  of  space-filling  samples,  contour-based  sampling  was  shown  to  offer  the  user  the  choice 
between  achieving  the  same  performance  (quantified  here  by  prediction  accuracy)  using  fewer 
samples,  or  achieving  better  performance  using  the  same  number  of  samples.  In  essence,  the 
proposed  sampling  method  created  the  opportunity  for  interplay  between  the  observed 
responses  and  the  samples  that  are  selected,  often  producing  a  more  effective  sample 
distribution  than  could  have  been  selected  a  priori. 

The  hypothesis  that  was  being  tested  was  that: 

Contour-based  sampling  will  balance  the  selection  of  cases  with  good  performance  and  the 
reduction  of  prediction  uncertainty  in  promising  regions,  identifying  samples  that  efficiently 
improve  surrogate  accuracy  for  configurations  with  small  aerodynamic  moments. 

The  preceding  evidence  has  shown  that  contour-based  sampling  did  in  fact  improve  surrogate 
accuracy  for  configurations  with  small  aerodynamic  moments.  In  light  of  this  evidence. 
Hypothesis  1  was  considered  to  be  supported. 
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7  Evaluating  Multi-fidelity  Modeling  &  Uncertainty 

The  previous  ehapter  evaluated  the  first  supporting  hypothesis,  whieh  addressed  the  proeess  of 
selecting  the  most  useful  data  points.  The  best  way  to  leverage  those  data  points  has  not  yet 
been  addressed.  That  is  the  purpose  of  this  section. 


A  previous  section  described  a  number  of  alternative  multi-fidelity  techniques  which  would 
allow  the  user  to  estimate  a  response  value  using  multiple  sources  of  data.  These  techniques 
included  data  harmonization,  additive  correction,  proportional  correction,  and  Ghoreyshi 
cokriging.  Eventually  the  predictive  accuracy  of  these  techniques  will  be  compared;  first, 
though,  each  method  needed  to  be  implemented. 

7 A  Implementing  the  Methods  Under  Consideration 

Most  of  the  methods  are  fairly  straightforward  variations  on  a  single-fidelity  Kriging  model  once 
the  low-fidelity  response  values  are  available.  The  most  challenging  aspect  of  their 
implementation  was  the  modification  of  the  DACE  Kriging  toolbox  to  incorporate  user-defined 
uncertainty  estimates  in  the  form  of  nuggets. 

7.1.1  Kriging  With  Nuggets 

This  description  of  Kriging  with  nuggets  is  based  on  the  work  of  Gramacy  &  Eee.[66]  To  add 
a  nugget  to  the  Kriging  mathematics  (given  in  Section  4.2)  is  fairly  straightforward.  The 
covariance  matrix  C,  which  describes  how  the  existing  training  data  points  relate  to  each  other, 
is  augmented  by  an  extra  term  along  its  diagonal.  Put  more  precisely,  the  entries  of  C  are 
calculated  as: 

C*(xj,x,  I  g)=  C(xj,x,)+Sj  ,-^  (19) 

(7 


Here,  c{xj,x^)  is  the  covariance  between  any  two  points  xj  and  Xk,  calculated  using  a  user- 

selected  covariance  function  such  as  the  Gaussian  function  (given  in  Equation  3).  is  the 
process  variance  of  the  response,  which  -  if  not  known  in  advance  -  is  commonly  estimated 
while  fitting  the  Kriging  model  to  the  data,  g  is  the  nugget,  an  expression  of  the  inherent 
uncertainty  in  the  measured  response  values.  This  uncertainty  may  stem  from  measurement 
errors  or  “microvariations”  operating  at  scales  too  small  to  be  captured  by  the  available  data, 
either  of  which  would  manifest  as  “white  noise”  in  the  response  behavior.  [67,  93]  Note  that  g 
is  always  >  0  and  has  the  same  units  as  the  process  variance.  5j^k  is  the  Kronecker  delta 
function, [165]  which  is  given  as: 


S 


jX 


1  i{j  =  k 
0  otherwise 


(20) 


The  DACE  Kriging  toolbox  already  adds  a  small  nugget,  with  a  value  on  the  order  of  machine 
precision,  to  the  covariance  matrix;  see  Section  5.1  in  Eophaven  et  al.[108]  for  more  details. 
This  nugget  is  intended  to  regularize  the  system  of  equations  and  increase  numerical  stability 
for  ill-conditioned  problems.  The  DACE  function  dacefit.m  contains  the  code  which  fits  a 
Kriging  model  to  data  points.  Eor  reference,  the  nugget  is  added  to  the  covariance  matrix  on 
lines  127-129  of  this  fUnction. 
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This  application  of  a  nugget  adds  a  constant  value  to  every  diagonal  term  of  the  covariance 
matrix.  This  is  a  common  approach,  especially  when  the  nugget  is  intended  to  improve 
numerical  stability  (as  in  the  DACE  toolbox)  or  when  measurement  error  is  considered  to  be 
constant  for  every  measurement.  [93,  95,  182]  However,  it  is  also  possible  to  apply  a  different 
nugget  value  to  each  diagonal  term,  as  described  by  Yin  et  al.[197]  Such  an  approach  would 
capture  any  variances  which  were  not  constant  throughout  the  design  space,  such  as  that 
stemming  from  the  iterative  solver  of  CartSD. 

Once  the  portion  of  DACE  code  which  applied  the  stabilization  nugget  was  identified, 
modifications  were  made  so  that  other  nugget  values  could  be  specified.  The  dacefit  function 
was  extended  to  allow  the  user  to  pass  an  extra  vector  of  inputs  which  contained  nugget 
values  for  each  data  point.  This  extension  was  relatively  simple  since,  as  previously  noted, 
dacefit  already  generated  the  machine  precision  nugget  as  a  vector  and  added  it  to  the 
covariance  matrix.  The  only  change  that  needed  to  be  made  was  to  add  the  user-specified 
vector  to  the  existing  vector  of  nuggets,  and  to  modify  the  function  input-output  specifications 
so  that  the  extra  input  parameter  would  not  cause  an  error  message. 

Note  that  in  Equation  19  the  nugget  is  scaled  by  the  process  variance,  which  is  unlikely  to  be 
known  in  advance.  To  obtain  this  value,  DACE  is  first  used  to  fit  the  model  as  if  the  data 
were  noiseless.  The  resulting  model  includes  an  estimate  for  the  process  variance.  The  desired 
nugget  may  then  be  divided  by  the  process  variance  so  that  it  is  in  the  format  that  the 
modified  dacefit  function  expects.  Casual  tests  indicated  that  the  addition  of  a  custom  nugget 
in  this  manner  does  not  significantly  alter  the  estimated  process  variance  value. 

This  modification  was  necessary  to  implement  data  harmonization;  conveniently  it  was  the  only 
change  necessary  to  allow  the  other  multi-fidelity  methods  to  incorporate  uncertainty  via 
nuggets  as  well.  Due  to  their  relatively  simple  formulations,  these  other  methods  will  be 
described  in  detail  before  data  harmonization  is  treated. 

7.1.2  Additive  Correction 

An  additive  correction  model  is  conceptually  the  simplest.  One  may  be  constructed  by  taking 
the  difference  between  the  high-fidelity  {Yhigh)  and  low-fidelity  (T/ow)  results  for  a  set  of 
training  data  (X): 

=  (21) 

A  Kriging  model  would  then  be  fit  to  this  difference,  Ydiff.  To  estimate  the  response  of  the 
high-fidelity  code  at  some  new  point  x,  predictions  are  made  for  the  low-fidelity  response  value 
and  the  value  of  the  difference.  These  predictions  are  then  added  together  to  estimate  the  high- 
fidelity  response: 


^high  ^low 

The  symbol  "  denotes  a  value  that  is  estimated  rather  than  obtained  from  a  data  source. 

If  the  low-fidelity  source  of  data  runs  quickly  enough,  it  can  be  used  directly  to  calculate  the 
low-fidelity  response  whenever  required.  Typically  this  would  include  every  point  in  the 
training  data  set,  as  well  as  every  point  for  which  a  high-fidelity  response  must  be  predicted. 
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In  the  case  of  contour-based  sampling,  the  category  of  points  for  which  the  high-fidelity  response 
must  be  predicted  would  include  all  candidate  and  test  points  in  every  round.  For  large 
problems  incorporating  thousands  of  candidates  and  test  points,  unless  the  low-fidelity  analysis 
runs  exceedingly  quickly  -  much  faster  than  a  second  per  case  -  it  may  be  worthwhile  to 
create  a  surrogate  model  of  the  low-fidelity  analysis  rather  than  waiting  for  the  low-fidelity 
model  to  process  all  the  necessary  cases. 

By  replacing  the  low-fidelity  data  source  with  a  surrogate,  another  source  of  uncertainty  is 
introduced:  the  discrepancy  between  the  response  estimated  by  the  surrogate  and  the  response 
calculated  by  the  low-fidelity  data  source.  The  process  of  estimating  this  uncertainty  will  depend 
on  the  type  of  surrogate  model  that  is  used.  If  a  response  surface  or  artificial  neural  network  is 
used,  the  prediction  variance  may  be  estimated  from  the  Model  Representation  Error  (MRE) 
standard  deviation,  calculated  as  part  of  the  goodness-of-fit  checks. [36]  Alternatively,  a 
Kriging  model  of  the  low-fidelity  response  can  analytically  calculate  the  prediction  variance  at 
any  given  point.  Either  way,  this  uncertainty  should  then  be  included  as  a  nugget  when  fitting 
a  Kriging  model  to  Ydiff,  as  that  value  depends  on  the  low-fidelity  prediction. 

Uncertainty  data  pertaining  to  individual  high-fidelity  data  points,  such  as  solution  convergence, 
would  also  be  included  in  the  nugget  for  the  }rf,//Kriging  model.  Uncertainty  due  to  model 
limitations,  which  would  be  considered  aerodynamic  variances  in  Section  5.8.3,  would  not  be 
included  in  the  nugget.  This  is  because  additive  correction  is  intended  to  produce  a  surrogate 
model  that  matches  the  response  from  a  particular  source  -  such  as  the  pitching  moment 
calculated  by  Cart3D  -  rather  than  trying  to  estimate  the  “true”  response  -  such  as  the  pitching 
moment  measured  by  a  full-scale  flight  test  vehicle.  This  difference  is  crucial  and  will  be 
revisited  in  later  sections. 

7.1.3  Proportional  Correction 

The  proportional  correction  approach  is  also  conceptually  straightforward.  These  models  are 
constructed  by  taking  the  ratio  between  the  high-fidelity  and  low-fidelity  results: 

r-.,,, = (23) 

^  low 

A  Kriging  model  is  fit  to  this  ratio,  .  To  estimate  the  response  of  the  high-fidelity  code  at 

some  new  point  x  ,  predictions  are  made  for  the  low-fidelity  response  value  and  the  ratio  at 
that  point.  These  predictions  are  then  multiplied  to  produce  a  prediction  for  the  high-fidelity 
response: 


Y  =  Y  xY 

high  low  ratio 


(24) 


This  approach  is  very  similar  to  the  additive  corrector,  but  is  better  suited  to  problems  where  the 
discrepancy  between  the  low-  and  high-fidelity  response  values  are  proportional  to  the 
magnitude  of  the  response,  rather  than  being  a  constant  bias.  It  may  not  be  possible  to  know  in 
advance  which  formulation  will  result  in  better  predictions,  so  some  experimentation  may  be 
necessary. 
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7.1.4  Ghoreyshi  Cokriging 

Ghoreyshi  cokriging  is  conceptually  less  intuitive  than  the  two  previous  methods,  but  equally 
easy  to  implement.  As  deseribedby  Ghoreyshi  et  ah, [63]  a  Ghoreyshi  eokriging  model  is  almost 
identieal  to  a  standard  Kriging  model  with  the  exeeption  that  the  low-fidelity  response  value  is 
treated  like  an  extra  input  variable  to  the  surrogate  model  for  the  high-fidelity  response. 

For  example,  eonsider  a  problem  with  two  input  variables,  xi  &X2.  The  matrix  of  inputs  F  for  a 
single-fidelity  Kriging  model  with  a  linear  underlying  trend  would  be  expressed  as: 


1 

A,i 

•^2,1 

1 

A, 2 

•^2,2 

1 

•^1,3 

•^2,3 

1 

A, 4 

•^2,4 

1 

A, 5 

•^2,5 

(25) 


In  eontrast,  the  matrix  of  inputs  F  for  a  Ghoreyshi  eokriging  model  with  a  linear  underlying 
trend  would  take  the  form: 


1 

^2,1 

Y 

1 

^1,2 

^2,2 

^ow,2 

1 

^1,3 

A, 3 

1 

^1,4 

^2,4 

Y 

^  low  A 

1 

^1,5 

^2,5 

Y 

^  low, 5 

(26) 


Y/ow,!  represents  the  low-fidelity  response  for  point  i.  Thus,  if  the  problem  of  interest  has  k 
input  dimensions,  the  Ghoreyshi  eokriging  model  will  have  k+\  input  dimensions.  The  proeess 
of  fitting  the  Kriging  model  (or  any  other  form  of  surrogate  model)  is  unchanged. 

7.1.5  Data  Harmonization 

Data  harmonization  is  unique  among  these  methods  for  a  number  of  reasons.  First,  while  the 
other  three  methods  fit  separate  surrogate  models  to  eaeh  type  of  data  (e.g.,  one  surrogate  model  is 
fit  to  the  low-fidelity  data,  and  then  a  seeond  surrogate  is  fit  to  the  diserepaney  between  the 
low-  and  high-fidelity  data),  data  harmonization  aeeounts  for  all  data  simultaneously.  This 
method  is  therefore  more  vulnerable  to  the  inereasing  modeling  eosts  of  Kriging  for  large  data 
sets.  This  vulnerability  stems  from  its  conceptual  heritage,  being  intended  to  eapture  the 
behavior  of  a  single  response  as  deseribed  by  multiple  observers. 

Secondly,  data  harmonization  has  a  different  eoneeptual  objective.  The  other  three  methods,  by 
fitting  eaeh  souree  of  data  separately,  are  explieitly  attempting  to  model  the  response  as 
understood  by  that  source  of  data.  Data  harmonization,  on  the  other  hand,  has  some  flexibility 
in  this  regard.  The  pre-analysis  effort  ineludes  the  subtraetion  of  any  known  effects,  as  seen  in 
Equation  32.  In  the  gamma  radiation  example  by  Baume  et  ah,  the  effect  of  elevation  on  the 
radiation  measurements  is  removed  from  the  data  using  an  analytieal  relation  before  the  Kriging 
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model  is  trained.  [14]  Beeause  this  elfeet  has  been  removed  from  the  data,  elevation  is  not 
included  as  a  dimension  in  the  Kriging  model.  It  falls  to  the  user  to  recognize  any  predictable 
effects  and  account  for  them. 

Like  in  universal  Kriging,  the  response  at  some  point,  Z{s),  is  taken  to  be  the  sum  of  a  mean 
function  m  and  a  stochastic  residual  S  : 


Z{s)  =  m{s)  +  d{s)  (27) 

The  notation  in  this  section  is  based  on  the  work  of  Baume  et  al.[14]  The  mean  function  m  is 
considered  to  be  a  combination  of  known  effects  and  effects  which  must  be  estimated: 

m{s)  =  FXs)a  +  F„{s)a  (28) 

Here,  represents  the  components  for  which  the  coefficients,  a  ,  are  known,  while  F^ 

represents  the  components  for  which  the  coefficients,  a  ,  must  be  estimated.  Categorical 
variables,  such  as  country  code  in  the  demonstration  given  by  Baume  et  ah,  are  handled  by 
adding  binary  column  vectors  to  the  set  of  input  parameter  values.  [13,  15]  For  example,  if  the 
country  code  can  take  one  of  three  values,  three  binary  column  vectors  are  introduced.  Cases  for 
which  the  country  code  takes  the  first  value  will  have  a  one  in  the  first  binary  column  and  zeros 
in  the  other  two,  and  so  on.  Bias  values  that  correspond  to  each  categorical  variable  value  may 
be  known  in  advance,  appearing  in  the  model  as  part  of  Fa(s)  a,  or  they  may  be  estimated  as  part 
of  the  model  fitting  process,  appearing  as  part  of  Fa{s)  a. 

Baume  et  al.[13]  noted  that  the  use  of  binary  vectors  for  categorical  variables  results  in  an 
over-determined  problem,  which  here  would  take  the  form  of  an  ill-conditioned  matrix  of 
basis  functions  during  the  creation  of  a  Kriging  model  (written  as  F  in  Equation  2  in  Section 
6.2,  not  to  be  confused  with  Fa  or  Fa  above).  Two  solutions  are  described  by  Baume  et  ah: 
either  the  least-squares  coefficients  for  the  binary  vectors  are  subject  to  the  additional  constraint 
that  they  must  sum  to  zero,  or  one  bias  is  assumed  to  be  zero  and  omitted  from  the  model.  This 
latter  option  is  enacted  by  eliminating  the  binary  column  corresponding  to  the  bias  in  question 
from  the  matrix  of  basis  functions,  F. 

Note  that  the  use  of  a  binary  vector  for  each  option  in  a  category  is  equivalent  to  making  the 
assumption  that  data  from  each  source  will  have  a  constant  bias  error.  More  complex 
relationships  between  data  sources  can  be  described  if  more  basis  functions  are  included  for 
each  categorical  variable.  A  constant-bias  data  harmonization  model  for  m  parameters  and  p 
data  sources  would  add  p  columns;  a  linear-bias  data  harmonization  model,  in  which  the 
discrepancy  between  data  sources  is  a  linear  function  of  the  input  dimensions,  would  add 
(m  -I-  l)x  p  columns  to  the  Kriging  model. 

Returning  to  Equation  27,  the  stochastic  residual  5  accounts  for  any  variations  from  the  mean 
trend.  It  is  modeled  as  a  random  process  with  a  mean  of  zero  and  a  user-specified 
covariance  function.  [170]  Often,  the  covariance  function  includes  free  parameters  such  as 
dimensional  weights;  we  again  assume  that  most  or  all  of  these  parameters  are  unknown  and 
must  be  estimated  while  fitting  the  surrogate  model. 
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Baume  et  al.  then  go  beyond  the  traditional  Kriging  formulation  that  is  used  for  most 
applications  by  addressing  the  idea  of  uncertain  data.  Rather  than  assuming  that  response  values 
are  known  exactly,  the  observations  are  treated  as  a  combination  of  the  true  response  at  each 
point,  Z{s),  and  some  measurement  error  term  z{s): 

7(5)  =  Z(5)  +  f(5)  (29) 

This  measurement  error  can  be  split  into  known  biases,  unknown  biases  (which  must  be 
estimated),  and  random  measurement  error  or  noise.  These  biases  correspond  to  discrepancies 
between  data  sources.  For  our  purposes,  this  can  be  written  as: 

sis)  =  G,is)b  +  G,is)/3  +  as)  (30) 

G/,(s)  is  a  vector  of  biases  for  which  the  coefficients,  b  ,  are  known  and  fixed.  G^(s)  is  a 
vector  of  measurement  biases  at  location  s  for  which  the  coefficients,  P ,  must  be  estimated, 
while  is  a  random  process  representing  the  measurement  error,  which  is  assumed  to  have 
zero  mean. 

Lastly,  Baume  et  al.  assume  that  the  stochastic  residual  S  and  random  measurement  error  ^ 
are  normally  distributed  and  mutually  uncorrelated.  The  covariance  matrices  of  S  and  ^  for 
the  given  set  of  observations  are  denoted  as 

V  =  Var(S) 

(31) 

W^Varia 

}V  is  a  diagonal  covariance  matrix  of  random  measurement  errors.  Baume  suggests  that  W  be 
set  by  repeating  a  measurement  multiple  times  using  the  device  or  method  in  question. 

Alternately,  expert  knowledge  may  be  used  to  set  one  or  more  values  in  W.  If  the  likely  noise 
behavior  of  data  points  is  not  uniform  for  all  data  points,  each  entry  in  W  may  have  a  different 
value  to  reflect  the  noise  at  that  point.  Essentially,  W  acts  as  matrix  of  nugget  values  for  each 
data  point. 

After  any  known  mean  or  bias  effects  have  been  subtracted  from  the  observations,  the  result  is 
the  expression: 


U  -Y  -  (F^  a  +  Gf,  b)  -  a  +  G^/3  +  S  +  t^ 


(32) 


The  two  random  processes  can  be  combined  into  a  single  process,  (/) ,  while  the  mean  and  bias 
functions  can  be  combined  into  a  vector  of  functions  to  obtain: 

U  =  xe  +  ^  (33) 
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where  X  =  [F^  GA  and  9  = 


As  a  result,  the  expected  value  of  the  model  (i.e.,  the  Kriging  mean)  is  given  hy  X  0  while  the 
variance  is  V  +  W  .  Fitting  the  model  requires  the  inversion  of  a  covariance  matrix  with 
dimension  equal  to  the  total  number  of  data  points,  similar  to  cokriging  or  the  autoregressive 
model  of  Kennedy  &  O’Hagan.  As  was  the  case  with  those  methods,  fitting  the  model  can 
become  expensive  or  infeasible  for  large  data  sets. 

As  mentioned  earlier,  Baume  et  al.  demonstrated  their  model  using  gamma  ray  dose 
measurements  from  multiple  European  countries.  The  elevation  of  the  sample  location  was 
treated  as  a  known  bias  using  an  analytical  relation.  Soil  composition  served  as  the  unknown 
effect  in  Fa,  and  a  country  code  was  introduced  as  a  bias  term,  Gp ,  to  capture  variations  between 
sets  of  data  provided  by  the  contributing  nations.  The  results  of  data  harmonization  demonstrated 
better  agreement  across  national  boundaries  compared  to  regular  Kriging,  as  well  as  increased 
prediction  confidence. 

Baume  et  al.[13]  suggest  two  possible  ways  to  construct  the  data  harmonization  model  in  order 
to  avoid  an  over-determined  system  of  equations:  either  an  additional  constraint  can  be  added 
to  the  least-squares  bias  estimates  so  that  all  biases  sum  to  zero,  or  one  of  the  biases  can  be 
assumed  to  be  zero.  The  latter  approach  removes  one  of  the  binary  columns  from  the  matrix  of 
samples.  For  the  problem  at  hand,  the  source  of  each  data  point  was  thus  represented  in  a 
single  binary  column,  which  took  a  one  if  the  case  in  question  was  from  CartSD  or  a  zero  if 
the  case  was  from  APAS.  This  was  equivalent  to  assuming  that  the  CartSD  data  had  no  bias, 
which  was  reasonable  when  the  surrogate  would  be  used  to  estimate  CartSD  results. 

As  mentioned  earlier,  the  single-binary-column  formulation  of  data  harmonization  is  equivalent 
to  an  assumption  that  the  discrepancy  between  APAS  and  CartSD  is  best  represented  by  a 
constant  offset.  More  complex  representations  of  the  discrepancy  are  possible,  albeit  at  the 
expense  of  a  larger  number  of  supplemental  columns.  Consider  a  two-dimensional  problem 
with  inputs  xi  and  X2  and  two  data  sources  5*1  and  5*2.  If  a  linear  trend  is  used  and  the  discrepancy 
between  the  two  data  sources  is  treated  as  a  constant  value,  fitting  a  Kriging  model  will  require 
least-squares  calculations  for  four  coefficients:  one  for  the  overall  mean,  one  to  weight  xi,  one  to 
weight  X2,  and  one  to  weight  the  binary  column  which  captures  the  discrepancy  between  the 
two  data  sources.  The  matrix  of  basis  functions  would  look  like  the  following: 


1 

^1,1 

^2,1 

b, 

1 

•^1,2 

•^2,2 

b. 

1 

^1,3 

•^2,3 

b. 

1 

^1,4 

•^2,4 

b. 

1 

•^1,5 

•^2,5 

bs 

The  b  column  contains  the  binary  entries  which  indicate  whether  a  given  row  is  from  data  source 
or5'2. 
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If  instead  the  diserepancy  is  considered  to  be  linearly  dependent  onxi  andxi,  the  least-squares 
calculations  must  estimate  values  for  six  coefficients:  one  for  the  overall  mean,  one  to  weight 
xi,  one  to  weight  X2,  and  three  to  weight  the  discrepancy  columns,  which  now  take  the  following 
form: 


1 

A,i 

•^2,1 

b: 

b, 

X 

^1,1 

b, 

X 

•^2,1 

1 

A, 2 

•^2,2 

b. 

b. 

X 

A,2 

b2 

X 

•^2,2 

1 

^1,3 

•^2,3 

h 

h 

X 

^1,3 

bs 

X 

•^2,3 

1 

A, 4 

•^2,4 

b. 

b. 

X 

A,4 

b. 

X 

•^2,4 

1 

A, 5 

•^2,5 

bs 

bs 

X 

^1,5 

bs 

X 

•^2,5 

One  extra  column  is  added  for  each  dimension  that  will  be  incorporated  into  the  discrepancy 
calculations.  Recall  that  the  bias  for  one  data  source  was  assumed  to  be  0;  cases  from  this 
source  will  have  h,  =  0 .  As  a  result,  cases  from  this  source  will  have  only  zeros  in  the  extra 
data  harmonization  columns: 


1 

A,i 

•^2,1 

0 

0 

0 

1 

A, 2 

•^2,2 

0 

0 

0 

1 

^1,3 

•^2,3 

0 

0 

0 

1 

^1,4 

•^2,4 

1 

\4 

•^2,4 

1 

A, 5 

•^2,5 

1 

\s 

•^2,5 

For  comparison,  when  fitting  an  additive  correction  model,  the  first  step  is  to  model  the  data 
from  the  low-fidelity  source.  For  this  model,  the  matrix  of  basis  functions  is  as  follows: 


1 

•^2.1 

1 

Xj  2 

•^2,2 

1 

^1,3 

•^2,3 

Once  this  model  is  available,  its  predicted  values  are  subtracted  from  the  response  values  from 
the  high-fidelity  source,  and  another  model  is  trained  to  capture  the  linear  trend  in  the  high- 
fidelity  data  using  a  similar  matrix  of  basis  functions: 

1  X,4  X2,4 

1  Xj  5  X2  5 


These  matrices  of  basis  functions  are  used  when  estimating  the  various  model  parameters,  such 
as  the  coefficients  of  the  underlying  trend  model  and  the  parameters  of  the  correlation  function 
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(e.g.,  6-  in  Equation  3).  If  the  model  parameters  for  the  low-fidelity  model  and  the  eorrecting 
model  were  fit  simultaneously,  the  matrices  of  basis  functions  would  be  combined; 


1 

•^2,1 

0 

0 

0 

1 

fi,2 

^2,2 

0 

0 

0 

1 

•^2,3 

0 

0 

0 

r 

A,4 

^2,4 

1 

A, 4 

^2,4 

1 

^1,5 

^2,5 

1 

A, 5 

^2,5  _ 

The  terms  in  the  upper-left  correspond  to  a  surrogate  model  that  fits  the  low-fidelity  data  only, 
hence  the  zeros  in  the  upper-right:  high-fidelity  data  does  not  directly  affect  this  process.  Once 
these  coefficients  have  been  fit,  the  terms  in  the  lower  left  capture  the  predicted  low-fidelity 
results  at  the  high-fidelity  data  sites.  The  discrepancies  between  the  predicted  low-fidelity 
results  and  the  high-fidelity  results  are  then  modeled  using  the  terms  in  the  bottom  right. 

It  now  becomes  more  clear  that  when  data  harmonization  is  extended  to  capture  more  complex 
discrepancy  behavior,  the  method  begins  to  resemble  additive  correction.  The  two  methods 
differ  in  that  additive  correction  fits  a  model  to  one  data  source  first  (i.e.,  fitting  a  model 
only  to  the  rows  where  h  =  0)  before  fitting  a  separate  model  to  the  discrepancy  values.  In 
contrast,  data  harmonization  fits  all  model  parameters  simultaneously,  and  thus  must  take  all 
available  data  points  into  account  at  the  same  time.  Huang  et  al.[83]  and  Kennedy  & 
0’Hagan[95]  both  address  the  question  of  whether  to  fit  these  parameters  simultaneously  or 
separately;  both  conclude  that  fitting  two  separate  models  is  significantly  easier  and  very  little  is 
lost  in  this  simplification.  Effectively,  additive  correction  is  equivalent  to  a  linear  data 
harmonization  model,  although  data  harmonization  fits  all  aspects  of  the  surrogate 
simultaneously  while  additive  correction  fits  them  separately. 

Data  harmonization  also  explicitly  captures  uncertainty  in  the  data  by  using  nuggets.  The 
modifications  to  DACE  toolbox  functions  that  were  necessary  to  implement  data  harmonization 
would  allow  any  Kriging  model  to  capture  uncertainty  via  nuggets;  thus,  once  nuggets  had 
been  implemented  for  data  harmonization,  they  were  available  for  use  by  every  other  modeling 
approach. 

Data  harmonization  is  still  unique  among  these  methods  in  that  it  handles  all  sources  of  data 
simultaneously.  A  trial  application  of  the  method  was  developed  in  order  to  gain  familiarity  with 
its  behavior.  Eor  the  sake  of  simplicity,  this  trial  was  based  on  the  two-dimensional  data  set 
that  was  used  for  early  contour-based  sampling  tests  in  Section  6.9. 

7.2  Two-Dimensional  Test 

The  data  set  for  the  pitching  moment  coefficient  at  Mach  0.3,  a  15°  was  used  because  the 
response  had  low  levels  of  noise,  but  was  still  difficult  to  model  using  only  Cart3D  data  and 
a  linear  or  quadratic  underlying  trend.  A  response  with  linear  or  quadratic  behavior  could  be 
modeled  with  relatively  few  samples;  more  complex  behavior  would  require  more  data  to 
improve  the  prediction  accuracy  of  a  surrogate  model. 
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7.2.1  Selecting  Samples 

Relatively  few  high-fidelity  data  points  would  be  used  in  this  test  in  order  to  determine  what 
improvement,  if  any,  a  multi-fidelity  approaeh  (and  in  partieular  data  harmonization)  eould  offer 
over  the  single-fidelity  approach.  For  the  results  presented  in  this  section,  6  high-fidelity 
samples  were  used;  this  test  was  repeated  for  as  few  as  4  and  as  many  as  9  high-fidelity  samples 
without  qualitative  changes  in  the  results. 

Those  6  high-fidelity  samples  would  be  selected  using  the  Matlab  function  Ihsdesign,  part  of  the 
Statistics  Toolbox. [120]  This  function  generates  a  Latin  hypercube  for  a  user-specified  number 
of  samples  and  dimensions.  These  hypercube  cases  would  be  mapped  to  the  design  space.  Recall 

that  the  space  had  been  sampled  with  a  50  X  50  grid  of  CartSD  samples;  the  hypercube  cases 
were  replaced  with  the  nearest  sampled  CartSD  case.  Although  this  does  affect  the  distribution 
of  samples,  the  sampling  resolution  ensured  that  the  replacement  cases  would  differ  from  the 
hypercube  cases  by  no  more  than  1  percent  of  the  range  of  each  variable.  This  was  considered  a 
negligible  discrepancy. 

Low-fidelity  samples  were  selected  from  the  available  APAS  results,  which  were  also  generated 

via  a  50  X  50  grid  of  samples.  The  ratio  of  low-fidelity  samples  to  high-fidelity  samples  was 
varied  from  1  to  15,  and  so  the  total  number  of  low-fidelity  samples  for  the  results  depicted 
here  ranged  from  6  to  90. 

The  distribution  of  Latin  hypercube  samples  is  to  some  degree  random,  so  the  sampling  process 
was  repeated  multiple  times  for  each  ratio  value.  Because  the  number  of  high-fidelity  samples 
would  be  a  constant  in  this  test,  the  single-fidelity  results  were  used  to  determine  the  number  of 
repetitions  necessary  for  the  average  results  to  converge  to  some  average  value.  The  single- 
fidelity  results  did  not  depend  on  the  low-fidelity  samples,  so  the  number  of  repetitions  was 
increased  until  the  average  prediction  accuracy  of  the  single-fidelity  models  remained 
essentially  constant  when  the  number  of  low-fidelity  samples  was  varied.  For  the  results 
presented  here,  the  sampling  was  repeated  700  times  for  each  data  pool  size. 

7.2.2  Modeling  Approaches 

The  baseline  approach  was  to  train  the  surrogate  using  only  CartSD  samples,  as  was  done  in 
the  motivating  study.  The  assertion  under  consideration  was  that  if  a  cheaper  source  of  data 
such  as  APAS  could  be  incorporated,  the  resulting  surrogate  models  would  have  better  predictive 
accuracy. 

Although  efforts  were  undertaken  to  ensure  that  the  configurations  analyzed  by  CartSD  and 
APAS  were  as  similar  as  possible,  certain  aspects  of  the  vehicle  geometry  were  not  represented 
in  the  APAS  model.  Whereas  in  the  PaceLab  environment  (and  consequently  in  CartSD)  the 
vertical  tails  at  the  wingtips  could  be  oriented  in  three  dimensions  using  a  cant  angle  and 
toe-in  angle,  the  vertical  tails  described  to  APAS  were  held  to  be  vertical  and  aligned  with  the 
vehicle  x-axis.  Control  surfaces  generated  by  PaceLab  could  not  be  accurately  reproduced  for 
APAS  due  to  the  low  granularity  of  the  cross-sections  used  by  the  latter  tool;  the  smallest 
control  surfaces  that  could  be  represented  in  APAS  might  be  many  times  larger  than  the  same 
surfaces  as  defined  in  PaceLab.  In  light  of  this,  control  surface  deflections  were  omitted  entirely. 
Lastly,  the  body  flap  at  the  rear  of  the  fuselage  was  also  omitted. 
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The  low-fidelity  samples  from  APAS,  although  mueh  less  expensive  than  high-fidelity  samples, 
might  be  eonsidered  something  of  an  unfair  “head  start”  for  the  multi-fidelity  methods;  the 
multi-fidelity  methods  will  benefit  from  the  extra  eomputational  effort,  however  slight,  that 
went  into  generating  the  low-fidelity  data.  To  address  this,  an  additional  single-fidelity 
approaeh  was  tested  where  the  number  of  high-fidelity  samples  was  inereased  from  6  to  7. 
Eaeh  high-fidelity  analysis  took  on  the  order  of 2,000  times  the  eomputational  effort  of  the 
eorresponding  low-fidelity  analysis,  far  more  than  the  maximum  amount  of  low-fidelity  data 
used  by  the  multi-fidelity  methods,  so  one  extra  high-fidelity  sample  was  more  than  equal  to 
the  extra  eomputational  investment  in  the  low-fidelity  samples. 

Data  harmonization  was  the  most  interesting  modeling  approaeh  in  this  test,  as  it  appeared  in 
the  literature  only  in  the  original  author’s  publieations.  There  was  some  uneertainty  as  to  what 
(if  any)  “noise”  should  be  captured.  The  model  was  intended  to  reproduce  CartSD  data  as 
precisely  as  possible  and  the  iteration  noise  was  negligible  relative  to  the  magnitude  of  the 
response,  so  the  high-fidelity  cases  were  considered  to  have  zero  noise.  The  low-fidelity  data 
source  was  deterministic,  with  no  obvious  uncertainty  in  its  results. 

The  low-fidelity  samples  would  easily  outnumber  the  high-fidelity  samples,  which  led  to  some 
concern  that  the  low-fidelity  data  might  skew  the  data  harmonization  model  in  favor  of 
matching  APAS  data  rather  than  CartSD.  To  test  this,  two  data  harmonization  models  were 
created.  The  first  treated  both  the  high-  and  low-fidelity  samples  to  be  deterministic,  with  no 
noise  or  uncertainty.  The  second  applied  a  constant  nugget,  equal  to  5  percent  of  the  range  of 
the  observed  low-fidelity  response  values,  to  all  low-fidelity  cases.  This  meant  that  the  data 
harmonization  model  would  interpolate  every  high-fidelity  sample  exactly  without  having  to  do 
the  same  for  the  low-fidelity  samples. 

Additive  correction,  as  arguably  the  most  popular  multi-fidelity  method  described  in  the 
literature,  would  also  be  applied  as  a  further  point  of  comparison  to  determine  the  effectiveness 
of  data  harmonization.  The  predictive  accuracy  of  each  modeling  approach  would  be  evaluated 
using  Root  Mean  Squared  Error,  as  defined  in  Equation  15. 

7.2.3  Evaluating  the  Results 

The  results  of  this  study  are  depicted  in  Eigure  26.  The  triangular  icons  pointing  down  and 
those  pointing  to  the  right,  which  represented  the  single-fidelity  models  with  and  without  an 
extra  case  respectively,  maintained  a  consistent  RMSE  value  as  the  number  of  low-fidelity 
samples  was  varied.  This  served  as  confirmation  that  sufficient  repetitions  were  performed  and 
the  results  were  adequate  representations  of  the  average  performance  of  each  method. 
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Figure  26:  Prediction  RMSE  for  All  Multi-Fidelity  Methods  at  Mach  3,  AoA  15° 

The  two  data  harmonization  approaches  -  one  including  a  nugget  for  low-fidelity  data  and  one 
treating  all  data  as  deterministic  -  track  each  other  closely,  which  indicates  that  the  low- 
fidelity  nugget  did  not  significantly  affect  prediction  accuracy  for  high-fidelity  response.  In 
fact,  the  data  harmonization  models  with  noise  often  performed  slightly  worse  than  the 
noiseless  models,  which  suggested  that  the  quantity  of  noise  included  (equal  to  5  percent  of  the 
observed  range  of  the  low-fidelity  data)  was  excessive  for  this  problem,  acting  to  “wash  out” 
useful  information  rather  than  downplaying  misleading  effects. 

The  data  harmonization  models  demonstrated  improved  predictive  accuracy  compared  to  the 
single-fidelity  methods;  the  prediction  error  as  quantified  by  RMSE  was  reduced  by  10-20 
percent  when  sufficient  low-fidelity  data  was  available.  However,  data  harmonization  was  out¬ 
performed  by  the  additive  correction  model,  which  tended  to  offer  roughly  twice  as  much 
improvement  in  prediction  accuracy. 

These  results  confirmed  that  multi-fidelity  modeling  could  enhance  prediction  accuracy  relative 
to  a  single-fidelity  approach.  Although  additive  correction  out-performed  data  harmonization  in 
this  case,  the  relatively  simple  nature  of  the  test  did  not  allow  either  method  to  be  declared  “best” 
for  the  large-scale  problem.  Additional  testing  with  a  more  complex  data  set  was  required. 
Furthermore,  there  still  existed  some  question  as  to  which  sources  of  uncertainty  were  useful  to 
capture  and  which  could  be  neglected.  The  next  section  is  a  review  of  the  sources  of  uncertainty 
under  consideration. 

7.3  Selecting  Sources  of  Uncertainty 

Numerous  uncertainty  sources  have  been  identified  and  detailed  in  the  literature. [43,  95,  186, 
202]  A  full  description  of  every  source  of  uncertainty  in  computational  analysis  is  far  beyond 
the  scope  of  this  work.  Entire  books  could  be  written  on  the  subject  of  estimating  the 
uncertainty  in  a  computational  result  -  and  have  been.[138]  Instead,  a  few  uncertainty  sources 
were  identified  as  being  of  particular  interest  based  on  observations  made  during  the  research 
effort  described  in  Section  4.  These  sources  are; 
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•  Uncertainty  resulting  from  finite  preeision  and  diseretization  during  analysis,  such  as 
measurement  or  numerieal  iteration;  [163] 

•  Uneertainty  resulting  from  the  use  of  a  surrogate  model  rather  than  the  original  analysis, 
i.e.,  imperfect  emulation; [95]  and, 

•  Uneertainty  due  to  use  of  a  data  souree  which  does  not  perfectly  emulate  the  response, 
i.e.,  imperfeet  fidelity. [44] 

The  first  souree  of  uneertainty,  diseretization  and  iteration  effects,  was  directly  observed  during 
the  motivating  effort  (see  Figure  7  on  page  42).  CartSD  uses  an  iterative  simulation  proeess  to 
eonverge  to  a  solution.  Responses  sueh  as  the  force  and  moment  coeflfieients  were  ealeulated  for 
every  iteration  step  and  written  to  an  output  file,  and  thus  the  eonvergenee  history  of  eaeh 
response  ean  be  investigated.  It  was  found  that  most  oases  exhibited  iteration  noise  whioh  was 
on  the  rough  order  of  0.001  to  0.01,  which  could  be  oonsidered  negligible  relative  to  the 
observed  pitching  moment  ooeflfioients  (whioh  were  often  on  the  rough  order  of  1  or  10)  but 
that  same  noise  was  signifioant  relative  to  the  observed  rolling  and  yawing  moment 
ooeflfioients  (whioh  were  on  the  rough  order  of  0.001  to  0.1).  It  is  plausible  that  oapturing  this 
iteration  noise  during  the  training  of  surrogate  models  might  improve  prediotion  aoouraoy  for 
lateral  responses. [109] 

The  seoond  souree  of  the  uneertainty,  imperfeet  emulation,  is  introduoed  when  a  data  souree  is 
replaoedby  a  surrogate  model.  It  was  mentioned  in  Section  3.5.3  that,  despite  the  relatively  rapid 
execution  time  of  APAS,  a  seoond  or  two  per  case  is  still  too  long  when  thousands  of  oandidate 
and  test  points  must  be  evaluated  by  the  oontour-based  sampling  algorithm  before  a  sample  ean 
be  seleoted.  A  surrogate  model  oould  estimate  low-fidelity  response  values  at  those  points  more 
quickly,  significantly  reducing  the  time  required  to  seleot  eaeh  sample,  but  would  introduoe 
additional  uneertainty  due  to  small  prediotion  errors. [202]  If  this  surrogate  is  Kriging-based, 
the  prediotion  uneertainty  ean  be  ealeulated  analytioally;  if  another  surrogate  model  type  is  used, 
the  standard  deviation  of  the  Model  Representation  Error  ean  be  used  to  oaloulate  the  prediotion 
varianoe.  Model  Representation  Error  is  ealeulated  as  part  of  the  “Goodness  of  Eit”  tests. [134] 

The  third  souree  of  uneertainty,  imperfeet  model  fidelity,  is  introduoed  whenever  modeling  is 
used.  This  ooours  for  all  experiments  short  of  full-soale  testing  under  operational  oonditions, 
as  evidenoed  by  the  Eockheed  C-141  Starlifter  mentioned  in  Seotion  3. 6. [23]  Although  that 
effort  inoluded  wind  tunnel  testing  of  the  wing  design  at  appropriate  Maoh  numbers,  there 
was  suflfioient  disorepancy  between  the  visoous  flow  behavior  over  the  subsoale  test  artiole  and 
the  full-soale  vehiole  that  the  designers  were  foroed  to  modify  the  oontrol  sohedules  and 
reinforce  the  wing  and  fuselage  to  avoid  exceeding  the  structural  limits  of  the  design. 

Eor  this  researoh  effort,  different  surrogate  models  would  be  used  at  eaeh  flight  oondition.  The 
independent  variables  for  eaeh  surrogate  model  would  be  geometrio  parameters,  unlike  other 
efforts  whioh  fixed  the  geometrio  shape  of  the  vehiole  and  varied  the  flight  oondition. [63] 
Given  the  low  probability  that  multiple  souroes  of  validation  data  will  fall  within  any  given 
design spaoe  (desoribed  in  Seotion  5.8.3),  it  was  unlikely  that  enough  information  would  be 
available  to  desoribe  how  the  uneertainty  due  to  imperfeet  fidelity  would  vary  with  respect  to 
the  free  parameters.  Instead,  this  uneertainty  would  be  modeled  as  a  uniform  value  for  eaeh 
flight  condition. 
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Two  hurdles  remained  before  uneertainty  eould  be  integrated  into  this  modeling  framework. 
First,  methods  must  be  available  to  quantify  or  estimate  the  eontribution  to  overall  uneertainty 
that  stems  from  each  source.  And  secondly,  although  uncertainty  could  be  captured  by  a  Kriging 
model  via  the  nugget  parameter,  and  uncertainty  present  in  Kriging  model  predictions  could  be 
estimated  via  prediction  variance,  it  had  yet  to  be  demonstrated  that  the  uncertainty  defined 
via  nugget  is  preserved  and  represented  in  variance  estimates. 

7.3.1  Quantification  of  Uncertainty 

Of  the  three  sources  of  uncertainty  considered  in  this  research  -  uncertainty  due  to 
discretization,  uncertainty  due  to  imperfect  emulation,  and  uncertainty  due  to  imperfect 
fidelity  -  two  could  be  quantified  in  a  straightforward  manner.  The  scripts  used  to  extract 
results  from  CartSD  output  files  were  modified  to  extract  response  values  over  the  final  30  solver 
iterations.  The  average  and  standard  deviation  of  the  series  were  calculated  for  each  response 
and  reported  as  part  of  the  solution  data  set.  This  provided  a  quantitative  assessment  of  the 
uncertainty  due  to  discretization. 

Uncertainty  due  to  imperfect  emulation  is  estimated  once  a  surrogate  model  has  been  trained, 
during  the  Model  Representation  Error  test. [36]  This  test  uses  the  surrogate  model  to  predict 
the  response  at  a  set  of  test  points  for  which  the  true  response  is  known.  Thus,  if  a  surrogate 
model  has  been  subjected  to  Goodness  of  Fit  tests,  an  estimate  of  the  uncertainty  due  to 
imperfect  emulation  should  already  be  in  hand. 

Only  uncertainty  due  to  imperfect  fidelity  remained.  As  discussed  in  Section  5.8,  the  literature 
indicates  that  this  uncertainty  is  quantified  similar  to  the  process  of  validating  a  tool;  the 
analysis  tool  is  applied  to  one  or  more  analyses  for  which  high-fidelity  results  are  available. 
Sponsors  at  the  Air  Force  Research  Laboratory  provided  wind  tunnel  data  and  surface  meshes 
for  three  variants  of  the  XCOR  Lynx  suborbital  vehicle.  [127,  177]  One  variant  of  this  vehicle  is 
displayed  in  Figure  27. 


Figure  27:  XCOR  Lynx  Suborbital  Vehicle  Variant 
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The  data  set  ineluded  foree  and  moment  measurements  taken  at  various  flight  eonditions.  The 
reported  data  also  ineluded  measurements  of  factors  such  as  local  pressure  and  temperature, 
factors  which  are  not  of  importance  when  running  an  inviscid  tool  such  as  CartSD  but  would  be 
required  information  if  this  data  set  were  used  for  validation  of  a  viscous  tool.  If  such  data  were 
not  recorded,  the  analyst  would  have  to  select  a  value  for  those  parameters;  values  might  be 
chosen  to  maximize  agreement  with  the  experimental  data,  a  process  closer  to  calibration  than 
validation  and  one  that  is  strongly  advised  against  by  validation  experts. [140]  Instead,  the 
analysis  should  be  done  in  ignorance  of  the  high-fidelity  result  values  whenever  possible  to 
avoid  accidental  or  deliberate  calibration. 


The  wind  tunnel  data  set  was  parsed  to  match  each  surface  mesh  with  the  flight  conditions  for 
which  data  was  available.  Mach  number  varied  from  0.29  to  4.5  and  angle  of  attack  varied 
from  -1.7°  to  43.5°.  Sideslip  angle  was  set  to  0°  for  most  cases,  but  20  of  the  71  parameter 
sweeps  fixed  the  Mach  number  &  angle  of  attack  and  varied  sideslip  angle.  The  flight 
conditions  in  the  data  set  are  plotted  as  angle  of  attack  versus  Mach  number  in  Figure  28. 
Some  sweeps  re-sampled  a  previous  range;  it  is  thought  that  these  repetitions  were  intended  to 
estimate  the  variability  in  the  results  as  recommended  by  Aeschliman  &  Oberkampf.[3]  In  all, 
632  data  points  were  available  for  the  surface  meshes  that  were  obtained. 
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Figure  28:  Flight  Conditions  for  XCOR  Data 

Cart3D  was  used  to  analyze  each  vehicle  variant  at  flight  conditions  which  matched  the  wind 
tunnel  data.  Note  that  some  model  comparison  efforts  prefer  to  match  lift  coefficient  rather 
than  angle  of  attack. [102]  Such  an  approach  was  rejected  here  due  to  the  effort  required  to 
match  the  observed  lift  coefhcient  for  each  of  the  632  data  points.  Instead,  the  vehicle  variants 
were  assessed  using  a  once-through  approach  identical  to  that  used  for  all  other  Cart3D  analyses 
in  this  effort.  The  reference  wing  area,  mean  aerodynamic  chord,  and  wing  span  were  obtained 
by  scaling  the  values  recorded  in  the  wind  tunnel  data  set  to  match  the  dimensions  of  the 
surface  triangulations;  the  best  match  between  the  dimensions  specified  for  the  wind  tunnel 
model  and  the  observed  dimensions  in  the  triangulation  was  a  factor  of  63,  so  the  reference 
lengths  were  multiplied  by  63  and  the  reference  area  by  63^.  A  center  of  gravity  at  70  percent  of 
the  vehicle’s  length  was  assumed. 
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It  should  be  noted  that  deeisions  on  the  part  of  the  analyst  eoneerning  appropriate  parameter 
settings,  sueh  as  grid  resolution  or  turbulenee  model,  ean  have  a  very  signifieant  etfect  on  the 
analysis  results,  at  times  even  larger  than  the  seleetion  of  the  analysis  tool  itself  [80]  Any 
diserepancies  between  wind  tunnel  measurements  and  CartSD  ealculations  that  are  presented 
in  this  doeument  should  not  be  eonstrued  as  inherent  limitations  of  CartSD;  it  is  quite 
possible  that  the  tool  was  applied  in  a  sub-optimal  manner  by  the  author,  and  that  better 
agreement  with  the  data  might  be  achieved  by  applying  CartSD  in  a  different  manner. 

With  that  caveat  given,  the  Cart3D  results  were  compared  against  the  wind  tunnel  data.  The 
results  presented  here  will  emphasize  pitching  moment  coeflhcient,  as  this  was  the  response 
that  most  significantly  restricted  the  feasible  design  space  in  the  RBS  study.  As  in  the  X-33 
uncertainty  database,  discrepancies  between  code  prediction  and  experimental  measurement 
were  plotted  versus  Mach  number.[32]  The  discrepancies  were  calculated  as; 

^  ~  ^CarliD  ~  ^WindTunnel  (^4) 

When  the  discrepancies  are  calculated  in  this  manner,  an  overly-positive  prediction  by  Cart3D 
results  in  a  positive  discrepancy.  These  discrepancies  are  plotted  in  Figure  29.  Figure  29a 
shows  the  data  for  Mach  numbers  less  than  0.6,  Figure  29b  shows  results  between  Mach  0.6 
and  1.35,  and  Figure  29c  shows  the  data  for  Mach  numbers  above  1.35. 


(a)  (b) 


Mach  Number 
(c) 


Figure  29:  Comparison  of  Cart3D  Results  to  Wind  Tunnel  Data,  Grouped  by  Mach  Number 

The  Mach  numbers  analyzed  are  clearly  identifiable,  but  with  respect  to  the  magnitude  of  the 
discrepancy,  the  icons  are  more  or  less  continuous.  This  may  not  be  surprising  given  the 
relatively  fine  resolution  of  the  experimental  plan  shown  in  Figure  28.  Although  the  X-33 
uncertainty  database  was  grouped  by  Mach  number,  for  these  results  that  grouping  did  not 
offer  any  insights  into  the  accuracy  of  Cart3D. 

No  obvious  trends  were  discernible  when  the  discrepancies  were  plotted  against  angle  of  attack, 
aside  from  a  general  upward  trend  with  increasing  angle  of  attack.  It  was  found  that  after  the 
results  were  grouped  by  Mach  number  as  in  Figure  29,  clear  trends  with  respect  to  angle  of 
attack  could  be  seen  within  the  group.  Those  results  are  visible  in  Figure  30. 
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Figure  30:  Comparison  of  Cart3D  Results  to  Wind  Tunnel  Data,  Grouped  by  AoA  for  (a)  Mach  <  0.6;  (b) 

0.6  <  Mach  <  1.35;  (c)  Mach  >  1.35 

In  Figure  30a,  which  shows  results  for  Mach  <  0.6,  the  discrepancies  show  a  fairly  clear  trend; 
they  are  small  close  to  0°  angle  of  attack  and  grow  larger  as  the  angle  of  attack  increases  to 
20°.  This  trend  appears  to  reverse  itself  for  higher  angles,  but  since  only  one  sweep  was 
performed  for  angles  of  attack  greater  than  20°  in  this  speed  regime,  the  author  is  hesitant  to 
generalize  from  that  data. 

Figure  30b  shows  results  for  Mach  numbers  between  0.6  and  1.35.  Here  the  trend  resembles 
what  was  observed  in  Figure  30a,  with  discrepancies  increasing  for  larger  angles.  These  results 
exhibit  more  spread,  even  at  lower  angles  of  attack,  which  is  not  surprising  given  the  known 
limitations  of  Euler  methods  at  transonic  speeds.  [7] 

Finally,  Figure  30c  shows  the  results  for  Mach  numbers  greater  than  1.35.  At  these  speeds,  the 
discrepancies  at  10°  are  less  than  for  the  subsonic  data  set,  but  the  discrepancies  for  supersonic 
flight  conditions  steadily  grow  with  angle  of  attack.  For  a  given  angle  of  attack,  a  higher  Mach 
number  corresponds  to  a  smaller  discrepancy. 

In  every  speed  regime,  the  discrepancies  between  Cart3D  and  wind  tunnel  results  increased 
with  angle  of  attack.  This  indicated  that,  for  larger  angles  of  attack,  Cart3D  would  over¬ 
predict  the  pitching  moment  by  progressively  larger  amounts.  It  may  be  that  the  effective 
center  of  mass  of  the  wind  tunnel  model  was  closer  to  the  nose  of  the  vehicle  than  the  point 
at  70  percent  of  body  length  that  was  used  for  Cart3D  calculations.  The  70  percent  position 
was  based  on  the  best  available  information;  barring  additional  information,  the  remainder  of 
this  work  will  assume  that  the  results  of  this  section  are  representative  of  the  discrepancy 
between  Cart3D  estimates  produced  by  the  author  and  experimental  data. 

These  results  could  be  used  to  estimate  the  likely  bias  and  random  errors  in  Cart3D  predictions 
that  were  due  to  model  fidelity  limitations.  The  bias  error  would  be  estimated  as  the  average 
prediction  error.  The  predictions  would  then  be  corrected  by  this  bias  estimate,  and  the  new 
prediction  errors  calculated.  The  variance  of  the  new  prediction  error  could  be  used  as  an 
estimate  of  the  random  error  variance.  These  two  estimates,  along  with  the  estimates  for 
uncertainty  due  to  iteration  noise  and  imperfect  fidelity,  would  enable  the  user  to  compute  the 
effects  of  all  three  uncertainty  sources  relevant  to  this  effort. 
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7.3.2  Categorizing  Sources  of  Uncertainty 

Note  that  not  all  the  uneertainty  sourees  have  the  same  implieations  with  respeet  to  the  data.  The 
first  two  sourees,  uneertainty  due  to  diseretization  and  uneertainty  due  to  imperfeet  emulation, 
pertain  to  the  simulated  response.  They  deseribe  uneertainty  with  respect  to  the  response  at  the 
current  level  of  fidelity  (e.g.,  the  response  as  calculated  by  APAS  or  CartSD). 

Uncertainty  due  to  model  fidelity  limitations,  on  the  other  hand,  describe  the  difference  between 
the  response  produced  by  the  simulation  -  e.g.,  pitching  moment  as  calculated  by  CartSD  -  and 
the  response  being  simulated.  By  capturing  the  first  two  uncertainty  sources,  surrogate  models 
will  be  more  accurate  with  regard  to  the  training  data.  If  the  last  uncertainty  source  can  also  be 
captured  effectively,  the  uncertainty  predictions  of  the  resulting  surrogate  model  might  be  used 
to  estimate  the  accuracy  of  the  surrogate  compared  to  the  data  used  to  estimate  model  limitation 
error. 

Using  the  validation  results,  the  uncertainty  due  to  model  fidelity  limitations  can  be  estimated. 
The  estimated  values  can  be  added  to  the  nugget  when  training  the  Kriging  surrogate. 
However,  because  uncertainty  due  to  inadequate  fidelity  reflects  the  probable  discrepancy 
between  that  level  of  fidelity  and  the  “true”  value  of  the  response  being  simulated,  it  is 
imperative  that  this  quantity  be  preserved. 

Although  it  was  demonstrated  that  uncertainty  information  can  be  passed  into  a  Kriging  model 
via  the  nugget,  and  all  Kriging  models  are  capable  of  estimating  the  predictive  variance  for 
the  predictions  that  come  out  of  the  model,  it  had  not  yet  been  demonstrated  that  uncertainty 
information  is  preserved  during  this  process.  Preservation  of  this  information  is  required  if 
uncertainty  due  to  model  fidelity  limitations  is  to  be  captured  in  this  manner.  A  small-scale  test 
was  therefore  devised  to  determine  whether  this  information  is  preserved. 

7.3.3  Preservation  of  Uncertainty  in  Kriging  Models 

A  set  of  “truth”  data  was  derived  from  wind  tunnel  tests  for  a  reusable  booster  configuration 
published  by  Post  et  al.[150]  The  16-case  data  set  corresponding  to  the  78  percent  center-of- 
gravity  location  was  selected.  This  selection  was  arbitrary.  The  exact  shape  of  the  vehicle  was 
not  available,  precluding  analysis  with  other  tools.  Instead,  a  notional  lower-fidelity  data  set 
was  created  using  only  the  points  with  small  angles  of  attack  where  the  relationship  between 
angle  of  attack  and  pitching  moment  was  approximately  linear.  The  rate  of  change  of  pitching 
moment  in  this  region  was  approximately  -0.01  per  degree.  Pamadi  et  al.[144]  showed  that  for 
the  Langley  Glide  Back  booster,  the  low-fidelity  tool  estimated  this  slope  fairly  well  but  failed 
to  capture  nonlinear  elTects  at  higher  angles  of  attack.  Bias  offsets  were  also  sometimes  present 
in  the  low-fidelity  data.  This  behavior  is  illustrated  in  Figure  6.  Here,  a  bias  offset  of  -0.04  was 
added  for  visual  clarity. 

The  uncertainty  due  to  simplifying  assumptions  -  in  this  case,  linear  aerodynamics  -  could 
not  be  estimated  in  a  general  form  based  on  this  information.  Instead,  a  very  rough  estimate 
of  uncertainty  was  made:  the  discrepancy  between  the  linear  results  and  the  wind  tunnel 
results  was  used  as  the  standard  deviation  of  this  uncertainty.  If  the  error  introduced  by  use  of  a 

linear  method  is  normally  distributed,  the  true  result  would  fall  within  ±  1  standard  deviation,  or 

la,  of  the  linear  estimate  around  68  percent  of  the  time.  A  ±2a  range  would  encompass  the  true 
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result  roughly  95  percent  of  the  time.  Figure  3 1  shows  the  relative  distribution  of  wind  tunnel 
results,  low-fidelity  linear  results,  and  uncertainty  bounds  around  the  low-fidelity  data  points. 
The  uncertainty  bounds  were  interpolated  for  ease  of  representation.  The  uncertainty  ranges, 

particularly  the  ±2a  bounds,  enclosed  a  broad  swath  of  possible  values.  This  reflected  the 
potentially  large  discrepancies  present  at  higher  angles  of  attack  where  the  linear  approximation 
was  the  least  accurate. 


A  Kriging  model  was  trained  using  the  16  low-fidelity  data  points.  A  linear  underlying  trend 
was  used  in  the  Kriging  model,  and  the  standard  deviation  at  each  data  point  was  squared  to 
produce  variances.  These  variances  were  incorporated  into  the  Kriging  model  via  nuggets. 
This  Kriging  model  was  then  used  to  estimate  the  response  and  prediction  variance  for  1 ,000 
points  interpolating  the  existing  data  set  in  order  to  assess  how  well  the  new  surrogate  could 
reproduce  the  uncertainty  that  was  described  using  nuggets.  The  prediction  variance  values 

were  converted  back  to  standard  deviation  for  plotting  purposes.  The  data  points,  ±lCT 


bounds,  and  ±2a  bounds  are  plotted  in  Figure  32,  just  as  they  were  in  Figure  31. 
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Figure  31:  Data  Points  &  Defined  Uncertainty  Ranges 

It  was  immediately  apparent  that  the  uncertainty  bounds  that  were  obvious  in  Figure  3 1  were 
not  visible  in  Figure  32.  The  presence  of  the  uncertainty  ranges  on  the  legend  indicated  that 
the  bounds  are  being  plotted.  A  closer  inspection  of  the  black  line  indicating  the  predicted 
response  values,  seen  in  Figure  33  held  the  answer.  Figure  33  is  a  highly  zoomed  view  of 

Figure  32.  The  ±1(7  and  ±2cr  bounds  were  present  in  Figure  32,  but  were  multiple  orders  of 
magnitude  smaller  than  the  uncertainty  which  was  described  using  nuggets. 
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Figure  32:  Data  Points  &  Estimated  Uncertainty  Ranges 
This  behavior  was  explained  by  referring  back  to  the  mathematical  formulation  of  Kriging  given 
in  Section  6.2.  In  particular,  note  that  the  estimated  process  variance  appears  both 
explicitly  and  implicitly  in  the  equation  for  prediction  variance  estimation  (Equation  4); 
remember  that  the  covariance  function  is  the  product  of  the  process  variance  and  the  correlation 
function.  The  estimatedprocess  variance  (Equation  5)  depends  on  the  quantity  {y-Fp\ 
which  is  the  difference  between  the  observed  response  and  the  value  of  the  underlying  trend  at 
that  point. 

Erom  these  equations  it  was  inferred  that,  when  the  observed  response  behavior  is  approximated 
well  by  the  trend,  the  estimated  process  variance  (and  by  extension  the  prediction  variance) 
would  be  close  to  zero,  no  matter  how  much  uncertainty  was  present  in  the  original  data  set. 

As  a  result,  uncertainty  information  that  is  fed  into  a  Kriging  model  may  not  be  preserved  in  the 
predictions  of  that  Kriging  model. 


Figure  33:  Zoomed  Region  of  Data  Points  &  Estimated  Uncertainty  Ranges 
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This  loss  of  information  is  of  little  concern  when  it  comes  to  the  effects  of  finite  precision  or 
surrogate  model  imperfections;  both  of  those  uncertainties  relate  only  to  the  quantity  being 
estimated,  i.e.,  pitching  moment  as  calculated  by  CartSD  (in  this  case).  Uncertainty  due  to 
model  fidelity  limitations,  on  the  other  hand,  describes  the  possible  discrepancies  between  the 
quantity  being  simulated  (e.g.,  pitching  moment  as  estimated  by  CartSD)  and  the  actual 
quantity  of  interest  (e.g.,  pitching  moment  experienced  by  a  full-scale  vehicle  at  representative 
conditions). 

As  with  the  Space  Shuttle  and  X-33,  knowledge  of  the  uncertainty  due  to  model  fidelity 
limitations  could  be  used  to  identify  where  further  analysis  is  needed,  and  -  if  the  present 
data  set  had  inherent  fidelity  limitations  -  where  a  better  analysis  method  would  be  needed  as 
well.  Such  knowledge  could  be  critical  for  risk-reduction  operations. 

More  to  the  point,  the  loss  of  such  information  could  lead  to  dangerous  over-confidence  on  the 
part  of  the  designers,  who  might  think  that  the  Kriging  prediction  variance  was  accurately 
capturing  all  relevant  sources  of  uncertainty.  To  avoid  this  situation,  it  is  recommended  that 
uncertainty  due  to  model  limitations  not  be  included  during  Kriging  modeling,  and  instead  that 
such  information  be  applied  as  a  sort  of  “error  bar”  after  the  modeling  is  completed. 

The  two  remaining  sources  of  uncertainty  still  had  to  be  assessed  to  determine  their  effect  on 
prediction  accuracy.  In  addition  to  determining  which  multi-fidelity  method  would  produce 
the  best  prediction  accuracy,  different  combinations  of  uncertainty  sources  would  be 
investigated  to  determine  the  effect  on  model  accuracy.  Multiple  versions  of  each  multi-fidelity 
method  would  be  applied  to  the  same  set  of  data,  differentiated  by  the  uncertainty 
information  being  captured  via  nugget: 

•  No  uncertainty  (deterministic  results); 

•  Only  uncertainty  due  to  surrogate  model  prediction  error; 

•  Only  uncertainty  due  to  solver  iteration  effects;  or, 

•  Both  surrogate  prediction  error  and  solver  iteration  effects. 

The  results  would  determine  which  types  of  uncertainty  should  be  incorporated  and  which 
multi-fidelity  method  would  be  selected  for  this  application. 

7.4  Comparing  Prediction  Accuracy:  Pitching  Moment 

Unlike  the  tests  of  contour  based  sampling,  tests  of  multi-fidelity  methods  would  not  emphasize 
any  particular  response  range.  Instead,  prediction  accuracy  would  be  quantified  using  the  Root 
Mean  Squared  Error[85]  of  the  predictions  and  the  95  percent  confidence  quantiles  of  the 
prediction  error.  [105] 

Prior  work[40]  showed  that  the  lateral  responses  exhibited  much  more  iteration  noise  at  40° 
angle  of  attack  than  at  0°  angle  of  attack.  Both  angles  were  of  interest  for  the  present 
application,  so  data  from  both  angles  of  attack  were  used  to  evaluate  the  multi-fidelity  methods 
and  uncertainty  sources.  It  was  expected  that,  due  to  the  increased  iteration  noise  at  high 
angles  of  attack,  models  which  captured  this  iteration  noise  would  have  better  prediction 
accuracy  at  that  flight  condition  than  models  which  did  not. 
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Data  was  generated  at  Maeh  4.0,  O  0°,  (S  0°  and  Maeh  4.0,  O  40°,  (S  0°.  Cases  from  the  nine¬ 
dimensional  space-filling  nested  Latin  hypercube,  described  in  Section  6.10,  were  analyzed  at 
these  new  flight  conditions  to  see  how  the  methods  fared  when  applied  to  a  problem  with  a 
moderately  large  design  space. 

Of  the  space-filling  nested  Latin  hypercube  cases,  6,000  were  evaluated.  This  completed  the 
4,000-case  level  of  the  NLHC,  plus  half  of  the  cases  needed  to  fill  the  8,000-case  level  of  the 
NLHC.  Although  the  partial  data  set  could  not  be  guaranteed  to  be  space-filling,  this  was  not 
expected  to  affect  the  present  test.  The  objective  was  to  evaluate  the  relative  accuracy  of  each 
method,  and  thus  the  supporting  data  pool  need  not  be  perfectly  space-filling  to  be  of  service. 
Five  alternatives  were  tested; 

•  Mono-fidelity  data  (only  CartSD  cases); 

•  Additive  correction; 

•  Proportional  correction; 

•  Ghoreyshi  cokriging;  and, 

•  Data  harmonization. 

Of  these  methods,  all  but  the  first  required  low-fidelity  data.  APAS  acted  as  the  low-fidelity 
data  source  for  this  research  effort.  In  the  interest  of  expediency,  surrogate  models  were  created 
to  estimate  aerodynamic  response  values  more  quickly  than  if  APAS  were  run  directly. 

7.4.1  Preservation  of  Uncertainty  in  Kriging  Models 

The  16,000-case  nested  Latin  hypercube  described  in  Section  6.10  was  analyzed  with 
APAS  at  Mach  4,  a  0°  and  Mach  4,  a  40°.  Neural  networks  were  chosen  as  the  surrogate 
modeling  technique  for  this  demonstration  due  to  the  ability  of  neural  networks  to  incorporate 
many  thousands  of  cases  during  the  training  process, [74]  whereas  a  Kriging  model  attempting 
to  include  the  same  number  of  cases  was  likely  to  be  intractable  due  to  memory  limitations  and 
excessive  computational  requirements. 

BRAINN,  which  stands  for  “Basic  Regression  Analysis  for  Integrated  Neural  Networks,”  is  a 
software  utility  used  to  train  the  neural  networks  and  an  internal  tool  at  the  Aerospace 
Systems  Design  Laboratory  developed  by  Carl  Johnson  and  Jeff  Schutte.[20]  BRAINN 
interfaces  with  utilities  from  Matlab’s  Neural  Network  Toolbox  and  automates  the  process  of 
fitting  neural  networks  of  various  sizes  to  the  data.  The  user  specifies  a  minimum  and 
maximum  network  size  to  consider,  as  well  as  the  error  definition,  and  the  software  will  cycle 
through  those  network  sizes  and  retain  the  network  that  best  fits  the  data  without  over-fitting. 
Over-fitting  occurs  when  the  model  being  trained  becomes  progressively  better  at  reproducing 
the  training  data  set  while  in  turn  becoming  progressively  worse  at  estimating  the  response  at 
other  data  points.  [74]  The  surrogate  models  of  the  pitching  moments  were  both  of  good  quality 
as  determined  by  goodness-of-fit  checks. [36] 

BRAINN  includes  the  option  to  export  the  best-performing  neural  network  in  a  format  that 
can  be  easily  interpreted  by  Matlab.  Those  exported  files  were  adapted  into  functions  so  that 
any  other  Matlab  script  or  function  could  submit  the  values  of  the  independent  variables  to  the 
function  and  receive  estimated  response  values  in  return.  The  functions  were  vectorized  so 
that  multiple  predictions  could  be  performed  simultaneously.  [122]  Once  the  surrogate  models 
were  accessible  by  Matlab,  the  tests  could  begin. 
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7.4.2  Execution  of  Analysis 

For  this  analysis,  certain  factors  were  held  constant.  The  Kriging  models  used  anisotropic 
Gaussian  correlation  functions  and  linear  underlying  trends.  The  first  6,000  of  the  16,000- 
case  nested  Latin  hypercube  cases  had  been  analyzed  previously  and  sorted  aecording  to  flight 
condition.  The  appropriate  data  set  was  loaded  and  divided  into  two  sets.  The  eases  which  made 
up  the  4,000-ease  hypereube  beeame  the  first  set,  as  this  set  of  oases  was  known  to  be  spaoe- 
filling.  The  remaining  cases  went  into  the  second  set. 

Some  number  of  oases  would  be  randomly  selected  from  the  first  set  using  the  function 
randsample  from  the  Matlab  Statistics  Toolbox  and,  based  on  those  cases,  one  surrogate  model 
would  be  trained  using  eaoh  multi-fidelity  teohnique.  The  surrogates  would  then  be  used  to 
predict  response  values  at  all  the  cases  in  the  second  set;  the  error  of  these  prediotions  would  be 
quantified  via  RMSE. 

The  number  of  cases  seleoted  varied  from  100  to  1,000  cases  in  inerements  of  100  to  assess 
how  rapidly  the  aceuracy  of  each  method  improved  when  more  training  data  was  available. 
Most  multi-fidelity  methods  needed  to  have  both  the  high-  and  low-fidelity  response  for  eaoh 
case  in  the  training  set;  the  high-fidelity  response  came  from  CartSD  analysis,  while  the  low- 
fidelity  response  was  estimated  with  the  neural  networks  trained  to  replicate  APAS  results. 

Because  data  harmonization  fits  a  model  to  both  high-  and  low-fidelity  data  simultaneously,  the 
cases  seleoted  as  desoribed  above  were  used  as  the  high-fidelity  samples,  while  a  separate  set 
of  1,000  random  cases  were  randomly  selected  in  the  same  manner  for  use  as  low-fidelity 
samples.  The  constant  size  of  the  low-fidelity  data  pool  was  chosen  based  on  a  desire  to 
balanoe  method  effectiveness  -  the  two-dimensional  results  (see  Figure  26)  indicated  that  the 
method  beeame  more  accurate  as  more  low-fidelity  data  was  included,  but  attempting  to 
include  15  low-fidelity  points  for  each  of  the  1,000  high-fidelity  eases  at  the  high  end  of  the 
range  would  be  eomputationally  devastating.  A  constant  set  of  1,000  low-fidelity  cases  would 
equate  to  a  data  souree  ratio  between  10  (for  100  high-fidelity  cases)  and  1  (for  1,000  high- 
fidelity  cases).  The  two  sets  of  data  were  then  eombined  and  modeled  jointly,  as  described  in 
Seetion  7.1.5.  If  data  harmonization  performance  degraded  relative  to  the  other  methods  as 
more  high-fidelity  samples  were  included,  the  experiment  could  be  repeated  with  a  constant 
proportion  of  low-fidelity  samples. 

All  eases  in  this  study  were  randomly  seleoted,  and  case  seleotion  would  aflfeot  the  performanoe 
of  the  model.  To  account  for  this  random  effect,  the  sampling-and-prediotion  prooess  was 
repeated  at  least  500  times  for  each  training  data  set  size  -  i.e.,  500  repetitions  with  100  CartSD 
samples,  500  repetitions  with  200  CartSD  samples,  etc.  -  and  the  resulting  RMSE  scores 
averaged. 

7.4.3  Relative  Speed  of  Each  Method 

The  reader  may  reoall  from  Seetion  5.6  that  the  time  required  to  fit  a  Kriging  model  inereased 
as  the  number  of  points  in  the  data  set  inereased.  In  general,  for  a  set  of  m  points,  the  effort 
to  fit  the  Kriging  model  grows  as  0(jn^)  .[136]  All  surrogate  models  evaluated  in  these 
experiments  used  Kriging,  and  thus  the  training  time  would  increase  as  more  points  were 
considered. 
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As  the  number  of  high-fidelity  samples  ranged  from  100  to  1,000,  the  amount  of  time  required 
to  build  and  evaluate  each  model  grew.  The  time  required  to  create  a  surrogate  and  evaluate 
its  predictive  accuracy  was  averaged  over  the  500  repetitions  performed;  the  results  are 
reported  in  Table  4.  Most  methods  demonstrated  similar  time  requirements.  The  exception 
here  was  data  harmonization,  which  required  almost  two  orders  of  magnitude  more  time  to  train 
models  and  evaluate  predictive  accuracy. 

As  stated  earlier,  data  harmonization  surrogate  models  for  these  tests  were  trained  using  the 
specified  number  of  high-fidelity  samples  and  1,000  low-fidelity  data  points.  Even  when  only 
100  high-fidelity  samples  were  used,  the  full  data  pool  contained  1,100  samples  -  much  more 
than  the  100  samples  being  modeled  by  the  other  methods.  Furthermore,  data  harmonization 
sometimes  threw  out-of-memory  errors  when  attempting  to  predict  response  values  for  all  test 
points  simultaneously.  As  a  work-around,  the  response  was  predicted  for  each  test  point 
individually,  which  reduced  the  memory  burden  but  increased  execution  time. 

The  bulk  of  the  processing  in  support  of  these  experiments  was  done  on  High  Performance 
Computing  clusters.  Jobs  were  submitted  for  processing  using  a  queue  system  wherein  the 
user  specified  the  number  of  computing  nodes  needed  for  the  job,  as  well  as  the  expected  job 
duration.  The  queue  system  limited  jobs  to  a  maximum  duration  of  7  days.  Running  in  parallel 
on  12  cores,  the  full  data  harmonization  test  with  500  repetitions  at  each  size  would  have 
required  over  1 1  days,  compared  to  around  1 .4  days  for  the  full  Ghoreyshi  cokriging  test.  The 
computational  investment  required  for  data  harmonization  was  clearly  much  larger  than  for  the 
other  methods. 


Table  4:  Average  Time  To  Build  &  Evaluate  a  Surrogate  (in  seconds) 


Number 
of  Points 

Single 

Fidelity 

Additive 

Correction 

Proportional 

Correction 

Ghoreyshi 

Cokriging 

Data 

Harmonization 

100 

1 

2 

1 

1 

95 

200 

2 

3 

2 

3 

96 

300 

4 

5 

4 

5 

116 

400 

7 

8 

7 

9 

141 

500 

12 

12 

11 

14 

175 

600 

18 

20 

18 

21 

204 

700 

26 

28 

27 

31 

233 

800 

35 

39 

36 

42 

279 

900 

44 

49 

44 

54 

300 

1,000 

56 

63 

57 

69 

336 

7.4.4  Results  of  Mach  4,  AoA  0°  Test 

The  average  RMSE  of  the  prediction  error  was  tracked  for  each  modeling  method;  the  results 
may  be  seen  in  Figure  34:  Prediction  RMSE  for  All  Multi-Fidelity  Methods  at  Mach  4,  AoA  0°. 
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Curiously,  the  single-fidelity  approaeh  was  eompetitive  in  this  case,  being  on  average  more 
accurate  than  additive  or  proportional  correction. 

Unlike  the  other  methods,  proportional  correction  behaved  somewhat  erratically;  note  that  there 
were  multiple  instances  where  the  accuracy  of  proportional  correction  did  not  increase  as 
more  training  data  was  made  available.  The  smooth  increase  in  accuracy  displayed  by  the  other 
methods  suggested  that  the  accuracy  of  proportional  correction  was  more  capricious  than  the 
others,  with  substantially  more  variability  in  the  quality  of  the  surrogates.  This  was  not  a 
desirable  quality. 

Data  harmonization  and  additive  correction  exhibited  similar  performance,  which  was 
unsurprising  given  the  similarity  between  the  two  methods  that  was  identified  in  Section  7.1.5, 
Ghoreyshi  cokriging  exhibited  very  good  predictive  accuracy,  even  when  relatively  few  high- 
fidelity  samples  were  available.  As  seen  in  Table  4,  this  method  did  take  slightly  longer  than 
most  of  the  other  methods  (commonly  10-20  percent  longer).  The  improvement  in  accuracy 
for  this  flight  condition  was  considered  sufficient  to  justify  the  increased  processing  time. 

3r 


Figure  34:  Prediction  RMSE  for  All  Multi-Fidelity  Methods  at  Mach  4,  AoA  0° 

Note  that  the  effects  of  uncertainty  information  were  not  called  out  in  these  results.  It  was 
observed  that,  for  this  response,  the  choice  of  uncertainty  data  had  a  much  smaller  effect 
than  the  choice  of  multi-fidelity  technique.  Given  the  scale  of  the  ordinate  axis  in  Figure  34, 
if  the  results  for  each  of  the  Ghoreyshi  cokriging  models  (no  uncertainty,  iteration  noise  effects, 
surrogate  approximation  noise  effects,  and  both  noise  effects)  were  plotted,  all  of  the  resulting 
icons  would  appear  to  be  coincident. 

7.4.5  Results  of  Mach  4,  AoA  40°  Test 

The  four  multi-fidelity  methods  were  then  applied  to  the  Mach  4,  a  40°  data  set.  Once  again, 
each  method  was  applied  using  four  different  noise  types  which  were  captured  via  nugget: 
zero  noise,  noise  from  the  iterative  solution  process,  noise  from  low-fidelity  prediction  error, 
and  noise  from  both  the  iterative  solution  process  and  the  low-fidelity  prediction  error.  The 
results  are  plotted  in  Figure  35. 
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Unlike  the  previous  test,  in  this  ease  the  proportional  eorreetor  demonstrated  very  good 
performanee.  Proportional  eorreetion  and  Ghoreyshi  eokriging  had  elfeetively  the  best  predietive 
aeeuraey  as  judged  by  RMSE.  Once  again,  the  single-fidelity  method  out-performed  the 
additive  corrector.  The  data  harmonization  predictive  accuracy  gradually  improved, 
approaching  that  of  the  additive  corrector,  as  more  high-fidelity  cases  were  included  in  the 
data  set.  Given  the  lackluster  performance  of  data  harmonization  and  the  very  large 
computational  expense  associated  with  the  method  as  noted  in  Table  4,  tests  of  data 
harmonization  were  curtailed  before  the  full  analysis  was  complete. 
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Figure  35:  Prediction  RMSE  for  All  Multi-Fidelity  Methods  at  Mach  4,  AoA  40° 


As  with  the  or  0°  results,  the  choice  of  uncertainty  information  did  not  significantly  affect  the 
predictive  accuracy  of  the  surrogates.  The  choice  of  multi-fidelity  technique  had  a  much 
larger  effect  on  predictive  accuracy. 

7.5  Observations  &  Further  Inquiry 

Given  the  strong  performance  of  Ghoreyshi  cokriging  for  both  sets  of  test  data,  it  was  selected 
as  the  most  effective  multi-fidelity  technique  for  this  application.  It  served  as  the  default 
modeling  approach  for  subsequent  experiments. 

A  number  of  curious  observations  were  made  in  the  course  of  these  tests,  such  as  the  highly 
variable  performance  of  proportional  correction  and  the  negligible  effect  of  incorporating 
uncertainty.  The  next  section  will  address  those  observations  in  greater  depth. 

7.5.1  Discrepancy  in  Proportional  Correction  Performance 

Qualitatively,  most  of  the  results  for  Mach  4,  a  40°  resemble  those  observed  for  Mach  4,  a  0° 
with  the  exception  of  proportional  correction.  Proportional  correction  had  the  worst 
performance  at  the  low  angle  of  attack  but  extremely  good  performance  at  the  high  angle  of 
attack. 


A  few  statistical  details  about  the  data  points  used  to  test  predictive  accuracy  -  1,719  points  for 
a  0°  and  1,693  for  a  40°  -  are  given  in  Table  5.  Critically,  note  the  very  large  variation  in 
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ratios  between  the  low-  and  high-fidelity  Cm  values  for  a  0°,  and  the  very  small  range  of 
variation  for  the  same  ratio  at  a  40° . 


Table  5:  Distribution  of  Cm  For  Both  Flight  Conditions 


Mach  4,  a  0°  Mach  4,  a  40° 

Carts  D 
Cm 

APAS 

Cm 

Ratio  of 
Carts  D  to 
APAS 

Carts  D 
Cm 

APAS 

Cm 

Ratio  of 
Carts  D  to 
APAS 

Minimum 

-12.2 

-33.1 

-113 

0.02 

0.88 

0.02 

Maximum 

2.99 

6.14 

61.7 

27.4 

72.3 

0.60 

Average 

-0.63 

-1.93 

0.22 

5.10 

12.8 

0.41 

Standard 

Deviation 

1.13 

3.05 

4.17 

3.61 

9.46 

0.06 

Within  ±  1 

79% 

52% 

- 

0.1% 

0.0% 

- 

Proportional  correction  fits  a  surrogate  model  to  the  ratio  of  the  high-fidelity  response  to  the 
low-fidelity  response  (as  seen  in  Equation  24).  When  the  response  values  are  close  to  zero, 
especially  the  low-fidelity  response,  small  absolute  changes  in  the  values  may  produce  large 
variations  in  the  ratio.  This  was  observed  in  the  cr  0°  results.  These  large  variations  can  be 
difficult  for  a  surrogate  model  to  fit  effectively.  As  a  result,  proportional  correction  models  may 
have  poor  accuracy  when  the  ratio  varies  rapidly,  such  as  when  the  low-fidelity  response  is  often 
close  to  zero. 

Conversely,  at  a  40°,  neither  the  low-  or  high-fidelity  data  encompassed  zero,  and  very  few  of 

those  cases  fell  within  ±  1 .  The  ratio  values  which  result  were  much  less  volatile  and  therefore 
easier  to  predict.  This  led  to  the  excellent  predictive  accuracy  demonstrated  by  the  method  in 
Figure  35. 

7.5.2  Modeling  Uncertainty  in  Pitching  Moment  Coefficient 

At  both  flight  conditions,  the  four  options  for  incorporating  uncertainty  resulted  in  very  small 
effects  with  respect  to  pitching  coefficient  prediction  RMSE,  on  the  order  of  1-2  percent. 

This  is  very  small  relative  to  the  differences  in  predictive  accuracy  between  multi-fidelity 
methods,  which  were  on  the  order  of  80  percent  or  larger.  Given  the  scale  of  the  vertical  axis  in 
Figure  34,  if  every  variant  of  a  given  multi-fidelity  method  were  plotted,  the  set  would  appear  to 
be  coincident. 

Modeling  uncertainty  did  not  improve  the  prediction  of  pitching  moment  coefficient  at  either 
flight  condition.  However,  pitching  moment  coefficient  had  relatively  low  levels  of  iterative 
noise.  In  fact,  the  pitching  moment  was  the  most  highly-weighted  factor  for  the  adaptive  grid 
refinements  of  Cart3D  (see  Appendix  D.4.2  for  more  details).  As  a  result,  the  pitching 
moment  was  often  well-converged. 
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The  use  of  nuggets  was  primarily  intended  to  improve  the  aeeuracy  of  surrogate  models  for 
lateral  responses,  which  were  more  susceptible  to  noise.  Although  this  improvement  was  not 
observed  for  pitching  moment  (a  longitudinal  response),  it  could  still  occur  when  modeling 
lateral  responses.  The  nine-dimensional  experiments  were  therefore  repeated,  using  yawing 
moment  coefficient  as  the  response  of  interest  rather  than  pitching  moment  coefficient. 

7.6  Comparing  Prediction  Accuracy:  Yawing  Moment 

To  review  from  Section  6.10,  the  data  set  featured  9  independent  variables  -  nose  droop,  nose 
fineness  ratio,  two  nose  spline  shape  parameters,  wing  half-span  fraction,  wing  airfoil  camber, 
vertical-tail-to-wing  area  ratio,  wing  root  chord  fraction  and  fuselage  radius  fraction  -  all  of 
which  affected  the  vehicle  outer  mold  line  (OML)  in  a  symmetric  fashion.  Additionally,  the 
available  data  points  were  for  flight  conditions  with  zero  sideslip.  The  flow  solver  does  not 
capture  viscous  effects,  so  effects  such  as  vortex  shedding,[16]  which  might  introduce 
oscillating  lateral  forces  and  moments,  were  not  present  in  the  simulations. 

In  short,  the  mechanisms  that  would  introduce  real  lateral  forces  or  moments  -  asymmetric 
OML,  asymmetric  flight  condition,  viscous  effects,  etc.  -  were  not  present,  and  thus  any 
such  forces  or  moments  which  appear  in  the  data  set  could  be  assumed  to  be  spurious,  resulting 
from  the  simulation  itself  rather  than  the  phenomena  being  simulated.  A  similar  conclusion 
could  be  made  for  the  low-fidelity  data  from  APAS. 

The  Unified  Distributed  Panel  analysis  (UDP),  which  acts  as  the  subsonic  and  transonic  analysis 
tool  for  APAS,  consistently  estimated  asymmetric  effects  for  symmetric  cases  to  be  zero. 
The  Supersonic  Hypersonic  Arbitrary  Body  Program  (S/HABP)  is  used  for  supersonic  and 
hypersonic  analyses  in  APAS  and  may  predict  small  but  nonzero  lateral  responses  for 
symmetric  cases.  The  flight  conditions  included  in  this  study  were  Mach  4,  a  0°  and  Mach  4, 
or  40°,  and  both  the  high-fidelity  and  low-fidelity  data  demonstrated  some  degree  of  noise. 

When  a  configuration  is  analyzed  using  CartSD,  the  program  discretizes  the  volume  around  the 
configuration  into  cells  and  repeatedly  solves  the  three-dimensional  Euler  equations  for  a 
perfect  inviscid  gas  within  each  cell  until  the  solver  converges  to  a  steady-state  solution.  [6] 
Because  the  equations  are  solved  on  a  grid  of  cells  rather  than  a  continuum,  some  small  errors 
may  be  introduced.  As  the  mesh  is  refined,  this  discretization  or  “truncation  error”  is 
reduced;[165]  this  also  corresponds  to  increasing  computational  effort,  as  more  calculations 
must  be  performed  for  each  iteration.  The  discretized  nature  of  the  analysis  can  lead  to 
oscillatory  behavior,  as  shown  in  Figure  7. 

Because  both  data  sources  included  some  degree  of  noise  -  iteration  noise  for  CartSD  data, 
surrogate  prediction  error  for  estimated  APAS  data  -  it  was  expected  that  this  study  would  test 
whether  the  inclusion  of  uncertainty  information  via  the  Kriging  nugget  would  improve  model 
prediction  accuracy  for  noisy  or  uncertain  data.  For  each  of  the  four  modeling  techniques 
(additive  correction,  proportional  correction,  Ghoreyshi  cokriging,  and  single-fidelity 
modeling)"^,  four  variants  were  tested.  Each  variant  was  defined  by  the  uncertainty  data  which 
was  incorporated  using  nuggets.  The  options  for  uncertainty  were: 


Data  harmonization  was  not  included  in  this  experiment  in  light  of  the  excessive  computational  effort  required 
and  its  poor  performance  in  the  previous  test. 
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•  No  uncertainty,  i.e.,  all  data  points  are  deterministic; 

•  Uncertainty  from  solver  iteration,  whieh  is  unique  to  eaeh  high-fidelity  data  point; 

•  Uncertainty  from  surrogate  model  predietion  error,  which  is  a  constant  value  for  every 
low-fidelity  data  point;  and, 

•  Uncertainty  from  both  solver  iteration  and  surrogate  model  predietion  error,  whieh  would 
be  a  eombination  of  unique  and  eonstant  values. 

As  in  the  previous  study,  random  points  from  the  first  4,000  spaee-filling  data  points  were 
seleeted  and  used  to  build  a  surrogate  model.  That  surrogate  model  was  then  used  to  estimate 
the  response  values  for  a  separate  set  of  roughly  1,700  data  points,  and  the  diserepaneies 
between  predieted  and  reeorded  values  were  used  to  ealeulate  the  predietion  RMSE  for  that 
method  and  form  of  uncertainty. 
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Figure  36:  Ghoreyshi  Cokriging  Prediction  RMSE  for  Yaw  at  Mach  4,  AoA  0° 

7.6.1  Results  of  Mach  4,  AoA  0°  Study 

For  eaeh  multifidelity  method,  the  same  pattern  was  observed.  Variant  1,  the  noiseless  variant, 
had  the  largest  (i.e.,  worst)  predietion  RMSE.  Variants  2  and  4,  the  former  using  iteration 
uneertainty  only  and  the  latter  using  both  iteration  and  surrogate  predietion  uneertainty,  would 
have  the  smallest  prediction  RMSE.  Curiously,  the  performanee  of  variant  3,  whieh  included 
only  surrogate  predietion  uncertainty,  differed  between  multifidelity  methods.  For  additive 
eorreetion  and  single-fidelity  modeling,  variant  3  performed  equally  well  as  variants  2  and  4. 

For  Ghoreyshi  eokriging  and  proportional  eorreetion,  variant  3  had  equivalent  performanee  to 
variant  #1,  the  noiseless  case.  Figure  36  shows  the  results  for  Ghoreyshi  eokriging.  As  noted 
above,  these  results  are  for  the  most  part  indicative  of  the  results  for  the  other  multi-fidelity 
methods  with  the  exeeption  of  variant  #3. 

The  actual-by-predieted  plots  for  two  types  of  Ghoreyshi  eokriging  models,  a  noiseless  model 
and  a  model  which  included  iteration  noise,  may  be  seen  in  Figure  36a  and  1  lb,  respeetively.  In 
an  aetual-by-predicted  plot,  a  perfeet  fit  would  appear  as  a  straight  line  from  the  lower-left 
eorner  to  the  upper-right  comer.  Sinee  all  data  eame  from  steady-state  symmetrie  models,  the 
tme  yawing  moment  was  expeeted  to  be  zero  for  all  oases.  The  bounds  on  the  graphs  were 
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selected  as  a  balance  between  capturing  the  range  of  the  deterministic  predictions  and 
showing  details  of  the  noisy  model’s  predictions. 

As  expected,  the  deterministic  model  attempted  to  reproduce  the  training  data  exactly.  The 
strong  horizontal  spread  in  the  data  indicated  that  this  model  would  over-predict  yawing 
moments  compared  to  the  calculated  values.  In  fact,  the  left  and  right  bounds  on  this  graph  cut 
off  19  percent  of  the  data  points,  while  the  upper  and  lower  bounds  cut  off  only  2.6  percent  of 
the  points. 

Conversely,  the  model  which  incorporated  iteration  noise  (the  “noisy”  model  shown  Figure  36b) 
had  a  strong  vertical  trend.  This  indicated  that  the  predicted  yawing  moments  were  smaller 
than  those  calculated  by  CartSD.  Given  that  the  flight  condition  and  vehicle  were  symmetric, 
it  was  plausible  that  the  predicted  results  might  actually  be  a  better  representation  of  the 
results  than  the  raw  CartSD  results. 
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Figure  36:  Actual-By-Predicted  Plots  for  Yaw  at  Mach  4,  AoA  0°  For  (a)  Noiseless  Model,  aud  (b)  Model 
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Lastly,  it  should  be  noted  that,  while  the  actual-by-predicted  plot  gives  a  decent  sketch  of  the 
relationship  between  the  actual  and  predicted  response  values,  it  lacks  precision;  points  lay 
atop  one  another,  making  it  difficult  to  convey  the  true  distribution  of  results.  It  is  worth 

stating  that  94  percent  of  the  CartSD  results  exhibited  yawing  moments  within  ±  0.005.  Of 
the  predictions  by  the  noisy  model,  99.6  percent  fell  within  that  range;  the  deterministic  model 
predicted  that  only  33  percent  of  the  cases  had  yawing  moments  within  that  range.  This 
highlighted  the  degree  to  which  the  deterministic  model  over-fit  the  response  behavior. 


7.6.2  Results  of  Mach  4,  AoA  40°  Study 

The  results  at  the  higher  angle  of  attack  were  qualitatively  similar  to  those  at  0°:  variants 
which  included  uncertainty  due  to  iteration  consistently  had  smaller  RMSE  values  (i.e.,  more 
accurate  predictions)  than  the  noiseless  variants,  while  the  effect  of  each  uncertainty  variant 
depended  on  the  multi-fidelity  method  being  applied.  The  results  for  Ghoreyshi  cokriging  are 
shown  in  Figure  37.  Unlike  the  previous  flight  condition,  the  relative  difference  between 
variants  was  much  smaller  in  this  study.  In  fact,  the  worst  prediction  RMSE  at  this  flight 
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condition  by  any  method  (0.081)  was  better  than  the  best  predietion  RMSE  at  a  0°  by  any 
method  (0.155).  Although  eapturing  uneertainty  due  to  iteration  noise  improved  surrogate 
aeeuraey,  this  improvement  was  smaller  at  this  flight  condition. 
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Figure  37:  Ghoreyshi  Cokriging  Prediction  RMSE  for  Yaw  at  Mach  4,  AoA  40° 

The  results  for  this  flight  condition  showed  less  variation  between  options.  Qualitatively,  both 
actual-by-predieted  plots  in  Figure  38a  &  13b  have  moderate  seatter.  Eaeh  exhibits  a  pattern 
which  is  more  strongly  horizontal  than  vertieal,  indieating  that  both  models  were  over-fitting 
the  data  to  some  degree.  This  was  refleeted  in  the  distribution  of  predietions;  93  pereent  of  the 

Cart3D  results  had  yawing  moments  within  ±  0.005,  whereas  82  pereent  of  the  “noisy” 
model’s  predietions  and  70  pereent  of  the  deterministie  model’s  predietions  fell  outside  that 
range.  At  first  glance,  the  distributions  of  the  Cart3D  results  were  roughly  equivalent,  but  the 
noiseless  model  had  mueh  better  agreement.  The  data  sets  were  investigated  to  determine  the 
cause  of  the  differences  in  performanee. 
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Figure  38:  Actual-By-Predicted  Plots  for  Yaw  at  Mach  4,  AoA  40°  For  (a)  a  Determiuistic  Model,  aud  (b)  a 

Model  Incorporating  Iteration  Noise 

127 


Approved  for  public  release;  distribution  unlimited 


JMP  statistical  software[91]  was  used  to  analyze  the  yaw  data  at  each  flight  condition.  The  first 
analysis  was  to  evaluate  the  distributions  of  the  yaw  and  iteration  noise  values.  Unexpectedly, 
the  iteration  noise  at  40°  was  typically  an  order  of  magnitude  less  than  at  0°,  not  greater.  This 
was  in  stark  contrast  to  what  was  observed  in  the  motivating  example  (Section  4.5.3)  which 
evaluated  iterative  noise  at  Mach  0.9.  The  reduced  iteration  noise  would  indicate  that  the 
high-a  results  at  Mach  4  are  less  subject  to  random  effects;  this  may  be  the  reason  for  the 
smaller  RMSE  values  observed  for  even  the  deterministic  models  when  applied  to  the  40°  data. 

The  second  step  was  to  analyze  the  correlation  between  the  variables.  Given  a  set  of  data,  the 
correlation  between  two  parameters  can  be  estimated  with  the  sample  correlation 
coelficient,[76]  also  known  as  the  Pearson  product  moment  correlation  coelficient: 


Yx,y,-nxy 

r=  I  (35) 

Here,  r  is  the  correlation  coelficient,  n  is  the  total  number  of  samples  in  the  data  set,  x,  and  y, 
are  the  individual  observations  of  variables  x  and  y  respectively,  andx  andy  are  the  average 
values  ofx  andy  over  the  n  samples. 

If  two  variables  behave  similarly,  e.g.,  when  one  increases  the  other  tends  to  increase,  they  are 
said  to  be  correlated.  The  sample  correlation  coefficient  quantifies  this  relationship.  A 
correlation  coelficient  value  of  1  indicates  perfect  positive  correlation,  i.e.,  when  x  increase, 
y  will  always  increase.  A  value  of-1  indicates  negative  correlation;  whenx  inereases,  y  will 
always  decrease.  A  value  of  0  indicates  no  correlation,  i.e.,  knowledge  ofx  tells  you  nothing 
about  the  behavior  ofy. 

Table  6  displays  the  correlation  between  the  absolute  value  of  the  yaw  response  ealculated  by 
Cart3D,  the  standard  deviation  of  the  Cart3D  yaw  solution  over  the  last  20  iterations  of  the 
analysis,  and  the  absolute  value  of  the  APAS  yaw  solution.  Note  the  negligible  correlation 
between  the  Cart3D  and  APAS  yaw  values.  This  was  unsurprising  given  that  both  values  were 
expected  to  be  spurious. 


Table  6:  Correlation  of  Yaw  Data 


Mach  2.5,  a  0° 

Mach  2.5,  a  40° 

Yaw 

(CartSD) 

OYaw 

Yaw 

(APAS) 

Yaw 

(CartSD) 

OYaw 

Yaw 

(APAS) 

|Yaw|  (Cart3D) 

1.00 

0.45 

0.04 

1.00 

0.16 

0.00 

<7yaw 

0.45 

1.00 

0.05 

0.16 

1.00 

-0.02 

|Yaw|  (APAS) 

0.04 

0.05 

1.00 

0.00 

-0.02 

1.00 
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In  the  AoA  0°  data,  there  was  a  moderate  positive  eorrelation  between  the  yaw  response  and  the 
observed  iteration  noise,  0.45.  The  standard  deviation  eannot  take  a  value  less  than  zero,  so 
taking  the  absolute  value  of  yaw  would  indicate  whether  there  was  more  observed  noise  for  cases 
that  were  reported  to  be  farther  from  zero,  which  seemed  to  be  the  case.  Iteration  noise  being 
larger  for  cases  farther  from  zero  suggested  two  things:  first,  that  those  cases  were  more  likely 
to  be  spurious  and  thus  negligible;  and  second,  the  Kriging  models  which  capture  that  iteration 
noise  via  nugget  would  be  more  likely  to  disregard  those  cases  and  tailor  the  model  to  the  cases 
with  less  noise  (and  likely  smaller  yaw  values). 

The  AoA  40°  results  showed  much  less  correlation  between  the  iteration  noise  and  the  CartSD 
yaw  result.  That  observation  would  indicate  that  even  when  iteration  noise  was  taken  into 
account,  the  resulting  models  would  not  be  as  effective  at  screening  out  that  noise  to  identify  the 
underlying  response  behavior.  This  matched  the  results  seen  in  Figure  39  where  both 
modeling  approaches  were  approximately  equally  accurate.  A  small  correlation  was  still  present, 
however,  and  capturing  iteration  noise  did  result  in  a  mild  improvement  in  prediction  accuracy. 
This  indicated  that,  although  capturing  uncertainty  with  nuggets  did  improve  prediction  accuracy, 
this  improvement  was  to  some  degree  proportional  to  the  extent  that  the  data  exhibited 
significant  noise. 

7.6.3  Discrepancy  Between  Expectations  &  Observations 

In  light  of  observations  made  during  the  Reusable  Booster  System  project,  the  AoA  0°  flight 
condition  was  expected  to  exhibit  relatively  low  noise,  while  the  AoA  40°  flight  condition 
would  exhibit  around  an  order  of  magnitude  greater  noise.  The  data  obtained  during  this 
research  effort  showed  those  behaviors  to  be  exactly  reversed. 

Fortunately,  the  data  sets  still  presented  the  opportunity  to  test  the  methods  on  responses  with 
varying  degrees  of  iteration  noise,  and  the  results  did  in  fact  confirm  the  expectations: 
capturing  uncertainty  via  nuggets  had  greater  effect  when  the  degree  of  noise  in  the  response 
was  larger.  This  outcome  did  little  to  address  the  question:  why  did  the  observed  behavior  differ 
from  expectation? 

The  changes  in  the  way  that  CartSD  was  operated  were  investigated  to  determine  which,  if  any, 
produced  the  unexpected  solver  behavior.  First,  most  analyses  during  the  Reusable  Booster 
project  ran  CartSD  in  “robust”  mode, [4]  which  forced  the  flow  solver  to  calculate  the  solution 
gradients  at  every  stage  of  the  Runge-Kutta  iteration  scheme,  rather  than  just  at  the  first  stage. 
Robust  mode  was  used  to  address  cases  which  were  failing  to  run  to  completion.  It  was  not 
found  to  be  necessary  for  the  current  analyses.  Running  cases  in  robust  mode  at  Mach  4.0,  cr 
0°  and  40°  did  not  produce  a  discernible  change  in  iteration  noise. 

Secondly,  the  convergence  tolerance  defined  in  the  aero.csh  script  (see  Appendix  D:  for  more 
details)  was  set  to  a  more  restrictive  default  value  in  the  current  version  of  CartSD  (1.4.7) 
than  in  the  version  used  for  RBS  tests  (1.4.3).  During  the  RBS  tests,  the  functional  error 
tolerance  was  set  to  0.005  by  default,  and  this  value  was  reduced  to  0.001  after  consultation  with 
the  software  developers.  [135]  Inversion  1.4.7,  the  defaultvalue  is  0.000001,  or  10“^,  a  difference 
of  three  orders  of  magnitude.  Cases  run  with  the  relaxed  error  tolerance  did  not  produce  any 
discernible  effect  on  the  iteration  noise. 


129 

Approved  for  public  release;  distribution  unlimited 


The  expectation  of  iteration  noise  increasing  with  angle  of  attack  was  drawn  from  observations 
made  during  the  RBS  study.  Those  observations  were  made  at  Mach  0.9.  It  was  possible 
that  the  increasing  iteration  noise  was  specific  to  the  transonic  flight  regime.  When  the 
distribution  of  iteration  noise  with  respect  to  angle  of  attack  was  investigated  for  Mach  2.5  using 
the  RBS  data  set,  it  was  found  that  those  observations  matched  the  ones  described  in  this  chapter. 
Rather  than  being  a  general  characteristic,  it  would  appear  that  the  iteration  noise  was  strongly 
dependent  on  the  speed  regime  being  simulated. 

7.7  Observations  &  Conclusions 

The  yawing  moment  study  showed  that  the  use  of  nuggets  to  capture  uncertainty  resulted  in 
improved  prediction  accuracy  at  both  flight  conditions.  Specifically,  representation  of 
uncertainty  due  to  solver  iterations  was  found  to  significantly  improve  the  prediction  accuracy  of 
all  modeling  techniques  evaluated.  Representation  of  uncertainty  due  to  surrogate  model 
prediction  errors  improved  prediction  accuracy  for  some  modeling  techniques  but  not  others. 
The  degree  of  improvement  depended  on  the  degree  to  which  iteration  noise  corresponded  to 
extreme  results. 

Uncertainty  due  to  surrogate  prediction  error  was  estimated  when  the  low- fidelity  surrogate  was 
trained  and  tested,  which  means  that  this  information  was  effectively  free.  Including  this 
source  of  uncertainty  did  not  uniformly  improve  prediction  accuracy,  but  it  was  never  observed 
to  degrade  accuracy  either.  Because  the  information  cost  nothing  to  quantify  and  could  possibly 
improve  model  accuracy,  it  was  decided  that  uncertainty  due  to  solver  iteration  and  uncertainty 
due  to  surrogate  prediction  error  would  both  be  retained  for  use  in  future  uncertainty 
calculations. 

Hypothesis  3  asserted  that; 

When  creating  a  Kriging  model,  the  use  of  nuggets  will  capture  uncertainty  in  the  data, 
improving  predictive  accuracy  for  noisy  responses. 

In  light  of  the  fact  that  models  which  capture  response  uncertainty  using  nuggets  were  shown 
to  have  better  prediction  accuracy  than  those  which  did  not.  Hypothesis  3  could  be 
considered  supported. 

In  a  similar  vein.  Hypothesis  2  asserted  that; 

Data  fusion  techniques  will  allow  results  from  high-fidelity  analyses  to  be  augmented  with 
cheaper  sources  of  data  to  produce  surrogate  models  that  are  more  accurate  yet  require  less 
computationally-expensive  data. 

Experiments  in  this  chapter  demonstrated  that  the  use  of  data  fusion  techniques  for  multi-fidelity 
modeling  improved  prediction  accuracy  over  what  could  be  achieved  with  only  one  source  of 
data.  This  result  indicated  that  Hypothesis  2  could  also  be  considered  supported.  This 
concluded  the  evaluation  of  the  lower-tier  hypotheses,  and  cleared  the  way  for  the  testing  of  the 
main  hypothesis  in  the  next  section. 
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8  Integrated  Modeling  &  Sampling  Procedure 

The  two  preceding  chapters  illustrated  how  the  selected  methods  produced  the  desired  efifects: 
contour-based  sampling  improved  prediction  accuracy  for  a  particular  response  range  of  interest 
in  Section  6;  Section  0  illustrated  that  capturing  uncertainty  using  nuggets  resulted  in  better 
accuracy  for  lateral  responses,  while  leveraging  cheaper  data  sources  using  Ghoreyshi 
cokriging  improved  prediction  accuracy  for  the  longitudinal  response. 

Although  these  enhancements  were  individually  significant,  further  benefits  could  be  derived 
by  combining  the  techniques.  The  effectiveness  of  contour-based  sampling  depends  on  its  ability 
to  accurately  estimate  response  values  at  various  points  in  the  design  space.  Combining  contour- 
based  sampling  with  multi-fidelity  modeling  would  improve  the  accuracy  of  response  estimates, 
allowing  a  more  accurate  assessment  of  candidates.  This  was  expressed  by  the  primary 
hypothesis: 

By  placing  samples  intelligently,  reducing  dependence  on  the  expensive  models,  and  better 
quantifying  the  level  of  confidence  in  each  data  point,  the  selected  methods  will  reduce  the 
computational  expense  of  high-fidelity  modeling  to  sufficient  extent  that  it  becomes  a  feasible 
option  earlier  in  the  design  process. 

A  relatively  modest  demonstration  was  desired  to  determine  what  degree  of  improvement  the 
integrated  methods  could  offer.  The  nine-dimensional  design  space,  first  described  in  Section 
6.10,  was  chosen  for  this  demonstration  as  it  would  allow  the  results  obtained  to  be  compared 
against  those  from  both  space-filling  samples  and  single-fidelity  contour-based  sampling. 

8. 1  Simplified  Test:  Nine  Input  Dimensions,  Three  Responses 

8.1.1  Creating  an  Integrated  Algorithm 

The  scripts  and  functions  created  for  the  contour-based  sampling  exercise  in  Chapter  4  were 
augmented  to  use  Ghoreyshi  cokriging  in  place  of  single-fidelity  modeling.  The  generic  steps  to 
implement  the  multi-fidelity  sample-selection  method  will  be  given  in  boldface,  while  details 
about  the  author’s  implementation  of  the  method  will  be  given  in  plaintext. 

First,  the  source  of  low-fidelity  data  is  identified.  In  this  case,  APAS  was  used.  Secondly, 
the  user  must  decide  whether  this  data  source  can  be  applied  directly  (i.e.,  analyzing  each 
case  directly)  or  if  a  surrogate  model  is  necessary.  Due  to  the  number  of  evaluations 
required  by  contour-based  sampling  and  the  non-negligible  time  required  for  each  APAS 
solution,  surrogate  models  were  used  for  this  implementation. 

To  generate  these  surrogate  models,  the  16,000-case  nested  Latin  hypercube  was  analyzed  at 
each  flight  condition  using  APAS.  BRAINN  was  then  used  to  create  neural  networks  for  the 
responses  of  interest  (Cm at  each  flight  condition),  as  described  in  Section  7.4.  Not  all 
16,000  cases  were  used  in  training:  20  percent  of  the  cases  were  used  for  validation  and  15 
percent  were  used  as  test  cases.  The  neural  networks  passed  all  goodness-of-fit  tests  with 
excellent  performance.  The  neural  networks  were  then  formatted  as  Matlab  functions  which 
would  take  inputs,  in  the  form  of  the  geometric  parameters  which  made  up  the  nine¬ 
dimensional  design  space,  and  return  estimates  for  the  response  for  each  input  case.  The 
functions  were  vectorized  so  that  many  cases  could  be  assessed  with  a  single  function  call. 
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Next,  whether  the  data  source  is  applied  directly  or  surrogates  are  used,  the  low  fidelity 
response  values  are  calculated  for  each  high-fidelity  training  case.  Once  high-  and  low- 
fidelity  responses  are  available  for  each  training  case,  multi-fidelity  Kriging  models  are  fit 
to  the  training  data,  incorporating  the  low-fidelity  responses  as  extra  input  dimensions  as 
described  in  Section  7,1,4. 

During  implementation,  the  question  arose  whether  it  was  more  etfeetive  to  fit  models  to  Yuigh, 
the  high-fidelity  responses,  or  -fow)’  the  diserepaneies  between  the  high-  and  low- 

fidelity  responses.  The  latter  option  might  be  eonsidered  a  hybrid  of  Ghoreyshi  eokriging 
and  additive  eorreetion.  When  tested,  it  was  found  that  the  two  methods  produeed  predieted 
response  values  that  differed  by  no  more  than  10“'"^.  Given  that  the  two  alternatives  were 
indistinguishable  with  regard  to  predietions,  it  was  deeided  to  fit  models  to  Yhigh  direetly, 
sinee  the  alternative  would  require  additional  arithmetieal  operations  for  every  model  trained  or 
response  estimated. 

Just  as  in  single-fidelity  sample  selection,  the  updated  Kriging  models  are  used  to  select  the  next 
sample.  The  diflferenee  between  the  new  algorithm  and  single-fidelity  eontour-based  sampling 
(deseribed  in  Seetion  6.1)  is  that  multi-fidelity  adaptive  sampling  requires  that  both  low-  and 
high-fidelity  responses  be  estimated  for  all  eandidate  and  test  points,  beeause  the  multi-fidelity 
Kriging  models  need  the  low-fidelity  responses  to  estimate  the  high-fidelity  responses. 
Otherwise,  the  proeess  of  evaluating  and  seleeting  eandidates  remains  the  same. 

8.1.2  Applying  the  Integrated  Algorithm 

When  single-fidelity  eontour-based  sampling  was  applied  to  the  9-dimensional  problem  in 
Seetion  6.11.1,  the  500-ease  level  of  a  nested  Latin  hypereube  was  used  as  the  initial  data 
set.  This  same  data  set  was  used  to  initialize  multi-fidelity  eontour-based  sampling  to  elueidate 
the  eflfeets  of  the  extra  data  souree.  Eaeh  sample  was  seleeted  using  a  fresh  set  of  eandidate  and 
test  points. 

A  tapering  probability-of-interest  (POI)  requirement  was  used:  the  first  five  seleetion  rounds 
required  POI  values  to  exeeed  25  pereent,  while  the  next  five  rounds  required  POI  to  be  greater 
than  15  pereent.  Rounds  11-35  required  POI  values  above  5  pereent,  and  rounds  36-70  simply 
exeluded  any  eandidates  with  POI  values  equal  to  zero.  This  POI  sehedule  was  intended  to 
plaee  early  samples  in  regions  that  were  expeeted  to  have  very  good  performanee  while 
allowing  later  samples  to  explore  regions  of  lower  predietion  eonfidenee.  If  no  eandidate  met 
the  POI  requirements,  the  eandidate  with  the  best  POI  value  was  seleeted  as  the  next  sample. 

Onee  70  new  cases  had  been  selected,  their  surface  meshes  were  built  in  PaceLab  and  their 
aerodynamics  analyzed  using  Cart3D.  The  results  were  added  to  the  training  data  set. 

Kriging  models  were  fit  to  each  response  based  on  the  available  data,  and  those  models  were 
used  to  predict  response  values  for  the  set  of  independent  test  data.  As  described  in  Section 
4.10.7,  these  test  cases  were  a  set  of  1,470  configurations  within  the  nine-dimensional  space  that 
produced  pitching  moment  coefficients  within  ±0. 1  at  all  three  flight  conditions  (Mach  0.3,  a 
15°;  Mach  0.8,  a  0°;  and  Mach  2.5,  a  0°).  The  prediction  error  for  the  Kriging  models  was 
quantified  using  root  mean  squared  error  (RMSE). 
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8.1.3  Evaluation  of  Accuracy 


Figure  39a  &  b  show  the  results  for  Mach  0.3,  a  15°.  The  black  square  icons  represent  the 
models  based  only  on  nested  Latin  hypercube  sampling  for  500, 1 ,000  &  2,000  cases.  Roughly  70 
percent  of  the  configurations  analyzed  produced  well-converged  results  at  all  flight  condition. 
Surrogate  models  based  on  these  space-filling  samples  produce  a  smooth  trend  of  improvement 
as  more  samples  are  included.  Together,  these  images  demonstrate  not  only  the  performance 
of  the  proposed  approach  compared  to  the  baseline,  but  also  the  relative  contributions  of  each 
technique. 


Cases  Included 
(a) 


500  1000  1500  2000 

Cases  Included 
(b) 


Figure  39:  Prediction  RMSE  at  Mach  0.3,  a  15° 


In  contrast  to  the  single-fidelity  modeling  of  space-filling  samples,  the  results  from  applying 
single-fidelity  contour-based  sampling  (i.e.,  selecting  new  samples  based  only  on  Cart3D  data) 
are  represented  by  grey  triangles  in  Figure  39a.  These  results  were  previously  shown  in  Section 
6.10  on  page  83.  The  downward-pointing  triangles  represent  the  use  of  a  low  probability-of- 
interest  (POI)  requirement  for  the  sample  selection  algorithm,  which  allowed  it  to  explore  the 
design  space  more  freely.  This  series  is  labeled  “CBS  1”. 


As  a  review  from  when  these  results  were  first  presented,  the  initial  batch  of  samples  for  the 
low  POI  requirement  actually  worsened  the  predictive  accuracy  for  this  flight  condition, 
although  the  second  batch  almost  balanced  that  out.  Progress  was  somewhat  slow  but  steady  as 
the  model  learned  about  the  response  behavior.  For  this  flight  condition,  the  rate  of  improvement 
was  approximately  equal  to  space-filling  sampling. 


The  upward-pointing  triangles,  labeled  “CBS  2,”  represent  the  use  of  a  higher  POI  requirement, 
which  resulted  in  the  selection  of  new  cases  that  were  fairly  close  to  existing  data  points.  This 
series  showed  slow  but  steady  improvement  in  predictive  accuracy.  The  rate  of  improvement 
was  roughly  equal  to  that  of  the  later  rounds  of  the  low-POI  series.  However,  because  the 
high-POI  series  did  not  suffer  from  early  missteps,  the  overall  performance  of  the  high-POI 
approach  was  better  than  the  low-POI  approach,  at  least  for  this  particular  response. 
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Figure  39b  compares  the  baseline  and  the  combined  techniques  against  the  use  of  data  fusion 
with  space-filling  samples.  For  the  500-case  space-filling  data  set  at  this  flight  condition, 
using  multi-fidelity  modeling  negatively  affected  predictive  accuracy,  resulting  in  an  RMSE 
value  (1.29)  that  was  significantly  larger  than  the  RMSE  value  for  the  single-fidelity  model  of 
the  same  data  (0.704).  The  diamond  icons  show  that  data  fusion,  when  applied  to  space¬ 
filling  samples,  had  mixed  results  for  this  response.  It  produced  a  mild  improvement  for  the 
1,000-case  data  set,  a  mild  degradation  for  the  2,000-case  set,  and  a  sharp  degradation  for  the 
500-case  set. 


In  contrast,  the  black  circles  denote  models  based  on  the  full  proposed  method,  which  leveraged 
both  Ghoreyshi  cokriging  and  contour-based  sampling.  The  series  starts  with  the  same 
performance  as  the  space-filling  multi-fidelity  surrogate,  representing  the  performance  of  the 
combined  techniques  before  adaptive  sample  selection  began.  Because  the  combined 
techniques  were  initialized  using  the  multi-fidelity  surrogate  of  the  500-case  data  set,  the  same 
poor  initial  accuracy  was  observed.  Despite  this,  the  proposed  approach  showed  its  worth 
quickly,  and  more  than  made  up  for  the  initial  degradation  after  a  single  batch  of  adaptive 
samples.  Model  accuracy  continued  to  improve  with  further  sampling  until  the  fourth  and 
fifth  rounds.  Recall  that  in  each  round,  half  the  samples  were  selected  using  a  minimum  POI 
of  zero,  allowing  the  algorithm  to  select  cases  which  might  be  of  interest  to  any  response  rather 
than  only  those  expected  to  be  of  interest  to  all  responses.  It  is  possible  that  samples  were 
selected  in  the  fourth  and  fifth  rounds  that  were  in  regions  with  poor  performance  at  this  flight 
condition,  and  when  the  Kriging  model  was  updated  to  capture  those  samples,  its  ability  to 
model  the  actual  region(s)  of  interest  for  this  flight  condition  was  negatively  affected.  After 
one  or  two  batches,  the  predictive  accuracy  once  again  began  to  improve  at  a  rate  faster  than 
that  of  the  space-filling  or  single-fidelity  approaches. 
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Figure  40:  Prediction  RMSE  at  Mach  0.8,  a  0° 


The  RMSE  results  for  Mach  0.8,  a  0°  are  displayed  in  Eigure  40a  &  b.  The  1,000-case  level  of 
the  NLHC  appeared  to  contain  misleading  results  for  this  flight  condition,  based  on  the 
observation  that  the  associated  surrogate  model  was  less  accurate  than  the  one  based  on  the 
smaller  500-case  NLHC  set.  The  2,000-case  set  produced  a  marked  improvement  over  both 
smaller  sets. 
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The  two  single-fidelity  adaptive  sampling  approaehes,  marked  as  grey  triangles  in  Figure  40a, 
showed  a  gradual  improvement  as  more  samples  were  seleeted  by  the  adaptive  sampling 
algorithm.  This  held  true  for  both  the  high-POI-requirement  series  (denoted  by  upward-pointing 
triangles)  and  the  low-POI-requirement  series  (denoted  by  downward-pointing  triangles).  No 
matter  whieh  POI  approaeh  was  used,  single-fidelity  adaptive  sampling  out-performed  space¬ 
filling  sampling  for  this  response.  In  fact,  the  single-fidelity  adaptive  sampling  approaches  out¬ 
performed  the  space-filling  approach  even  when  almost  twice  as  many  space-filling  samples 
were  available. 


The  use  of  space-filling  samples  and  multi-fidelity  modeling,  marked  with  grey  diamonds  in 
Figure  40b,  once  again  produced  mixed  results.  The  initial  gain  in  predictive  accuracy  for 
surrogates  trained  with  the  500-case  data  set  (from  an  RMSE  of  0.61  for  single-fidelity 
modeling  to  0.50  for  multi-fidelity  modeling)  became  a  loss  of  accuracy  for  the  1,000-case 
data  set;  there  was  minimal  difference  in  predictive  accuracy  between  single-fidelity  and  multi¬ 
fidelity  surrogates  trained  on  the  2,000-case  data  set. 

The  black  circles,  on  the  other  hand,  denote  the  results  of  using  the  proposed  approach,  multi¬ 
fidelity  modeling  and  adaptive  sampling.  The  first  batch  of  samples  selected  by  the 
combined  techniques  produced  a  substantial  improvement  in  prediction  accuracy,  reducing 
RMSE  from  0.50  to  0.34.  Eater  batches  continued  to  produce  improvements,  although  none  so 
substantial  as  the  first  batch.  Some  evidence  of  diminishing  returns  was  observed.  Overall,  the 
proposed  approach  was  very  effective  at  improving  surrogate  model  predictive  accuracy  for  this 
response,  out-performing  single-fidelity  models  based  on  either  adaptive  sampling  or  space¬ 
filling  sampling. 
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Figure  41:  Prediction  RMSE  at  Mach  2.5,  a  0° 

Easily,  Eigure  41a  &  b  depict  the  results  at  Mach  2.5,  a  0°.  The  space-fdling  cases  exhibited  a 
more-or-less  linear  trend  of  improving  accuracy  as  the  training  data  pool  grew,  although  at  a 
more  gradual  rate  than  was  observed  for  Mach  0.3. 

The  single-fidelity  adaptive  sampling  approaches,  denoted  by  grey  triangular  icons  in 
Eigure  41a,  had  very  different  initial  behavior.  The  series  with  the  low  POI  requirement. 
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which  was  more  tolerant  of  exploratory  sampling,  initially  produeed  a  sharp  reduction  in 
prediction  error,  but  lost  some  of  those  gains  after  a  few  rounds.  Still,  even  after  that  reduetion 
in  aecuraey,  this  series  out-performed  both  other  single-fidelity  approaehes.  The  more 
restrietive  single-fidelity  adaptive  sampling  approaeh,  with  the  high  POI  requirement,  initially 
beeame  less  aecurate  but  then  improved  rapidly.  In  fact,  despite  the  initial  misstep  this  series 
outperformed  the  space-fdling  approaeh  when  both  approaches  had  an  equal  quantity  of  data 
available. 

The  grey  triangles  in  Figure  41b  illustrate  the  effects  of  space-filling  samples  and  multi-fidelity 
modeling,  incorporating  APAS  data  through  the  use  of  Ghoreyshi  eokriging.  A  substantial 
improvement  was  observed  for  all  spaee-filling  data  sets,  and  in  most  oases,  prediotion  error 
was  halved.  For  example,  for  the  500-oase  data  set,  RMSE  was  reduoed  from  0.81  for  single¬ 
fidelity  modeling  to  0.43  for  multi-fidelity  modeling  of  the  same  samples. 

The  blaok  oiroles  in  both  Figure  41a  &  b  show  the  prediotive  performanoe  for  surrogates  made 
with  both  multi-fidelity  modeling  and  oontour-based  sampling.  Even  though  prediotion  error 
had  already  been  reduced  substantially  by  the  use  of  multi-fidelity  modeling  alone,  the  first 
batoh  of  multi-fidelity  contour-based  samples  provided  almost  an  equal  improvement.  This 
brought  the  prediotion  RMSE  below  25  peroent  of  the  original  spaee-filling  value.  This  result 
suggests  a  relatively  simple  relationship  between  the  behaviors  of  the  low-  and  high-fidelity 
responses.  Eater  rounds  also  improved  performanoe  slightly,  although  for  the  most  part  the 
prediotive  aoouraoy  appeared  to  be  steady. 

Overall,  the  oombination  of  multi-fidelity  modeling  and  oontour-based  sampling  produeed 
signifioant  improvements  in  prediotion  acouraoy  for  all  three  flight  conditions.  At  Mach  2.5, 
using  only  210  samples  the  integrated  algorithm  aohieved  a  level  of  prediotion  aoouraoy  that 
would  likely  require  thousands  more  spaee-filling  oases  to  matoh.  This  appears  to  be  strong 
support  for  the  hypothesis  that  these  methods  will  enable  the  use  of  high-fidelity  modeling 
earlier  in  the  design  prooess  than  was  previously  possible. 

This  test  problem  with  its  9  free  parameters  was  still  fairly  simple  oompared  to  a  typioal  design 
problem.  Before  the  primary  hypothesis  of  this  researoh  oould  be  oonfirmed,  the  method  must 
be  applied  to  a  more  ohallenging  problem;  the  oomplete  design  space  of  the  Reusable  Booster 
System  study. 

8.2  Full-Scale  Test:  Forty-Nine  Input  Dimensions,  Twelve 
Responses 

This  experiment  was  intended  to  test  the  eflfeotiveness  of  the  proposed  method  for  a 
representative  engineering  problem.  The  design  of  a  Reusable  Booster  System  (RBS)  was 
dilficult  beeause  of  the  relatively  large  number  of  input  dimensions  and  the  eomplexity  of  the 
response.  Such  a  vehicle  must  fly  a  very  demanding  trajeetory,  with  a  broad  range  of  attitudes 
and  speed  regimes. [71]  Planned  trajectories  for  such  vehicles  inelude  angles  of  attaek  up  to 
and  ineluding  40°,  mueh  larger  than  most  air  vehioles.[19,  25,  98]  Pamadi  et  al.[144]  showed 
that  a  minimum  of  Euler  CED  would  be  required  for  some  of  the  flight  eonditions, 
partieularly  the  higher  angles  of  attaek  where  lower-fidelity  analysis  tools  beeome  less 
aeeurate. 
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Higher-fidelity  modeling  sueh  as  Euler  CFD  or  viscous  simulation  would  entail  significantly 
increased  computational  efifirt  per  analysis.  Additionally,  the  many  input  parameters  made  for 
a  large  design  space;  if  the  response  behavior  was  even  moderately  complex  with  respect  to 
the  input  parameters,  understanding  the  response  at  a  resolution  adequate  for  decision-making 
purposes  could  require  a  very  large  number  of  analyses.  At  the  present  time,  creating  an 
accurate  surrogate  model  for  a  large  design  space  would  require  tens  or  hundreds  of  thousands 
of  analyses,  with  correspondingly  daunting  computational  costs. 

The  wide  range  of  flight  conditions  meant  the  analyst  must  simulate  a  potential  configuration  at 
numerous  conditions  to  evaluate  its  effectiveness  -  a  configuration  which  has  tolerable 
aerodynamic  moments  at  one  flight  condition  might  be  unmanageable  at  another,  and  a  useful 
RBS  design  be  controllable  over  its  entire  retum-to-launch-site  trajectory.  However,  this 
challenge  held  the  seeds  of  success:  by  modeling  only  the  design  space  regions  which  were 
likely  to  be  controllable  for  all  flight  conditions,  the  computational  requirements  might  be 
reduced  significantly.  Incorporating  a  cheaper  source  of  data  would  further  diminish  the 
number  of  high-fidelity  analyses  required.  These  observations  led  to  the  proposed  method  that 
was  the  subject  of  this  research. 

8.2.1  Flight  Conditions  of  Interest 

The  flight  conditions  for  the  full-scale  experiment  were  selected  based  on  likely  trajectories. 
The  Fly-Back  Booster  designed  by  DFR[51,  98]  was  expected  to  fly  a  return-to-launch-site 
(RTFS)  trajectory  from  a  reentry  speed  just  under  Mach  6,  with  an  angle  of  attack  schedule  that 
peaked  at  35°  at  Mach  6  and  fell  below  1 0°  at  Mach  4.  This  vehicle  featured  a  secondary  turbine 
propulsion  system  for  powered  subsonic  flight,  enabling  higher  staging  speeds.  At  the  other 
end  of  the  spectrum,  the  Fangley  Glide -Back  Booster[26,  144]  did  not  include  a  secondary 
propulsion  system.  Instead,  its  reference  trajectory  featured  an  unpowered  aerodynamic  turn 
and  gliding  return  to  the  launch  site.  This  trajectory  placed  an  upper  limit  on  the  staging  Mach 
number:  if  the  booster  staged  much  faster  than  Mach  2,  it  would  travel  too  far  from  the  launch 
site  to  glide  back.  The  nominal  trajectory  featured  a  peak  speed  of  Mach  2  at  an  angle  of  attack 
of  48°,  with  the  angle  of  attack  falling  rapidly  to  below  20°  at  speeds  below  Mach  1 .4. 

The  present  effort  focuses  on  the  “rocketback”  RTFS  maneuver,  [79]  in  which  the  main 
propulsion  system  of  the  reusable  booster  would  be  used  to  decelerate  the  vehicle  after 
staging.  This  extinguishes  the  horizontal  velocity  of  the  vehicle,  limiting  the  downrange 
distance  traveled  before  the  vehicle  can  perform  an  atmospheric  turn  and  begin  its  unpowered 
flight  back  to  the  launch  site.  This  maneuver  would  also  drastically  reduce  the  heating 
experienced  by  the  vehicle  during  its  descent. [19]  Reference  trajectories  started  with  reentry  at 
angles  of  attack  up  to  40°  and  reached  peak  speeds  between  Mach  2.5-3.  The  angle  of  attack 
would  then  be  reduced  to  roughly  10°,  dipping  close  to  0°  as  the  booster  fell  below  the  speed  of 
sound  and  then  returning  to  5-10°. 

To  approximate  these  reference  trajectories,  four  flight  conditions  were  selected  from  the  RBS 
modeling  effort  described  in  Chapter  2: 

•  Mach  2.5,  a  40°,  (3  0°; 

•  Mach  2.5,  a  15°,  (3  0°; 

•  Mach  0.8,  a  0°,  (3  0°; 

•  Mach  0.3,  a  15°,  (3  0°; 
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These  flight  conditions  sketched  out  a  rocketback  RTLS  trajectory,  including  each  of  the 
various  flow  regimes  such  a  vehicle  would  experience.  Table  7  details  the  performance  of  the 
neural  networks  produced  during  the  RBS  effort,  each  of  which  was  trained  using  roughly 
10,000  cases.  The  test  R^  values  indicated  that  the  neural  networks  are  capturing  almost  all  of 
the  variability  observed  in  the  data;  this  showed  that  the  overall  variability  in  the  data  set  was 
also  large,  supporting  the  conclusion  that  many  of  the  configurations  being  analyzed  had  poor 
aerodynamic  qualities.  Although  the  R^  values  for  test  data  were  high,  which  is  one  indication 
of  a  good  fit,  the  prediction  error  had  a  fairly  large  standard  deviation. 

These  large  standard  deviation  values  indicated  that  any  prediction  made  with  these  surrogates 
would  have  significant  uncertainty.  Error  bars  could  be  used  to  quantify  the  magnitude  of 
that  uncertainty.  For  example,  the  standard  deviation  for  CpUchSii  Mach  0.3,  a  15°  was  0.395; 
if  a  prediction  were  made  with  this  neural  network,  a  set  of  error  bars  that  would  have  95  percent 
confidence  of  enclosing  the  actual  response  (as  calculated  by  Cart3D)  would  have  to  extend  ±2cr, 
or  ±0.79.  This  would  be  a  very  large  range,  especially  compared  to  the  stated  response  range 
of  interest  (±0.1),  and  could  make  the  difference  between  a  viable  design  and  one  that  cannot 
be  controlled.  Given  that  range,  it  was  unlikely  that  a  designer  could  use  those  surrogate  models 
for  design  purposes  with  any  confidence. 


Table  7:  Neural  Network  Prediction  Accuracy  for  Cm 
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The  full-scale  experiment  was  therefore  intended  to  determine  whether  the  combined  methods 
contour-based  sampling,  multi-fidelity  modeling,  and  incorporating  uncertainty  -  would  offer 
any  improvement  in  performance  relative  to  the  previous  results.  The  objective  had  changed 
somewhat  from  what  the  RBS  project  attempted  to  do;  whereas  that  effort  attempted  to 
produce  surrogate  models  which  would  be  equally  accurate  throughout  the  design  space,  the 
present  objective  was  to  be  as  accurate  as  possible  for  regions  where  moments  are  close  to  zero, 
and  sufirciently  accurate  in  other  regions  that  those  regions  could  be  identified  as  having 
moments  far  from  zero.  Testing  the  effectiveness  of  each  method  would  require  a  pool  of  test 
cases  with  moments  close  to  zero.  These  test  cases  first  had  to  be  identified. 
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8.2.2  Selecting  Test  Cases 

Acceptable  test  cases  for  this  experiment  had  to  have  good  performance  at  all  flight  conditions, 
representing  a  set  of  vehicle  designs  whieh  were  likely  to  be  eontrollable  along  the  return-to- 
launch-site  (RTLS)  trajeetory.  The  same  genetic  algorithm  approach  which  was  used  to 
identify  test  cases  for  the  9-dimensional  problem  was  applied  to  the  eurrent  set  of  49  free 
parameters.  Although  ultimately  there  were  12  responses  of  interest  -  the  3  aerodynamic 
moments  at  each  of  the  4  flight  conditions  -  it  was  expected  that  the  pitehing  moment 
coefficient  at  each  flight  condition  would  be  the  primary  factor  which  determined  whether  a 
given  configuration  would  be  feasible.  Thus,  although  there  were  12  responses  to  be  modeled,  it 
was  expected  that  test  eases  need  only  be  seleeted  on  the  basis  of  4  of  those  responses:  the 
pitehing  moment  coefficient  at  eaeh  flight  condition. 

All  data  points  from  the  previous  RBS  study  whieh  were  analyzed  at  all  4  flight  conditions 
relevant  to  the  present  study,  and  whieh  met  all  the  eonvergence  criteria  detailed  in  Appendix 
D:  were  included  in  the  initial  data  set  for  the  genetie  algorithm  sampling.  A  total  of  7,371  sueh 
cases  were  found.  As  before,  each  variable  was  represented  with  an  8-bit  string,  effectively 
transforming  continuous  variables  into  discrete  variables  with  256  possible  settings.  Each  case 
was  mapped  to  the  binary  settings  which  mostly  closely  approximated  its  parameter  values.  A 
full  factorial  sampling  of  the  spaee  at  this  resolution  would  require  somewhat  more  than  10^'^ 
cases. 


The  fitness  function  used  was  the  same  as  for  the  9-dimensional  example: 
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A  record  of  every  ease  ever  analyzed,  and  the  associated  results,  was  kept.  The  best  500  cases 
were  seleeted  and  subjeeted  to  the  genetie  algorithm  operators  -  tournament  selection, 
crossover,  and  mutation  -  to  ereate  new  cases  for  the  next  population.  As  before,  the  crossover 
rate  was  70  percent  and  the  mutation  rate  was  10  pereent  to  encourage  exploration  of  the 
design  space.  The  case-ereation  proeess  would  eontinue  until  500  new  cases,  which  did  not 
match  any  previous  cases,  had  been  generated.  Those  500  new  cases  would  then  be  analyzed 
using  Cart3D  and  the  results  added  to  the  ease  reeords. 

Forty  iterations  of  this  proeess  were  performed,  requiring  the  evaluation  of  20,000  cases  at  each 
of  4  flight  conditions.  2,370  of  the  cases  analyzed,  roughly  12  percent,  had  pitching  moment 
eoeflficients  within  ±0.1  for  all  4  flight  conditions.  The  algorithm  took  quite  a  few  rounds  to 
find  useful  test  cases,  but  as  more  good  cases  were  identified  the  rate  increased.  No  test  cases 
were  found  in  the  first  13  batches  (6,500  cases);  the  final  5  rounds  averaged  just  over  185 
new  test  cases  per  bateh  of  500,  or  37  percent.  The  search  for  test  eases  was  curtailed  after  40 
batehes  because  the  2,370  cases  available  at  that  point  were  felt  to  be  suffieient. 
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The  accumulated  test  cases  were  investigated  using  JMP  as  a  sanity  check.  Scatterplot 
matrices,  which  are  a  method  of  visually  identifying  trends  and  patterns  in  a  data  set,  were 
generated  for  all  49  variables  to  see  what  deductions  could  be  from  the  distribution  of  test 
cases.  A  partial  scatter  plot  showing  4  variables  is  given  in  Figure  42. 

A  scatterplot  is  a  set  of  2-variable  distributions.  For  example,  the  uppermost  block  in  Figure  42 
displays  cases  with  the  Fuselage  Loft  End  value  as  the  abscissa  or  horizontal  component  and 
Nose  Droop  as  the  ordinate  or  vertical  component.  Each  black  dot  in  this  block  represents  a 
single  configuration.  Each  configuration  is  plotted  in  every  block.  The  distribution  of  dots  can 
reveal  trends  in  the  data,  such  as  regions  with  unusually  dense  or  sparse  sampling. 

Eor  example,  for  cases  in  the  test  set  -  i.e.,  cases  which  were  found  to  have  small  pitching 
moments  at  all  flight  conditions  -  it  was  likely  that  the  Nose  Droop  value  was  at  the  high  end  of 
the  range:  1,908  of  the  2,370  test  cases  have  Droop  values  of  0.8  or  above,  which  indicates  that 
for  most  of  these  test  configurations  the  tip  of  the  nose  was  close  to  the  bottom  of  the  vehicle. 
Eikewise,  most  of  the  selected  test  cases  had  small  Wing  Span  Traction  values  and  large  Wing 
Root  Chord  Traction  values.  In  contrast,  there  was  no  trend  visible  with  respect  to  the  Euselage 
Loft  End  parameter. 


Scatterplot  Matrix 


•  •  •  * 

I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  *  I  III  I  I  I  I  I  I  II  I  I  I  I  I  I  I  I  ^  I  I  |l  M  I  I  II  I  M  I  |l  I  I  I  III  I  I  M|  I  I  I  I  II 


0.050.090.120.150.18  0.5  0.6  0.7  0.8  0.9  0.6  0.8  1  1.1  1.3 

Fuselage-  Nose- 

Loft  End  Droop  Wing  -  Span 

Figure  42:  Partial  Scatterplot  for  Selected  Cases 
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Because  genetic  algorithms  were  used  to  select  these  test  cases,  the  cases  were  not  independent, 
so  it  was  possible  that  clustering  behavior  might  appear  that  stemmed  from  the  way  cases 
were  selected  rather  than  the  underlying  behavior  of  the  response.  To  determine  whether  such 
spurious  clustering  was  likely  to  be  a  problem  in  this  manner,  another  scatterplot  was  generated 
using  the  full  set  of 20,000  cases  which  were  evaluated  during  the  genetic  algorithm  search.  A 
partial  scatter  plot  showing  the  same  4  variables  is  given  in  Figure  43.  Each  black  dot  represents 
a  single  configuration;  each  configuration  is  represented  in  every  box. 

In  Figure  43,  the  design  space  is  shown  to  be  sampled  fairly  thoroughly.  No  regions  were  left 
unsampled,  although  some  regions  were  sampled  more  thoroughly  than  others.  For  example,  in 
the  lower-left  scatterplot,  it  can  be  seen  that  there  were  more  cases  with  high  Wing  Root  Chord 
Fraction  values  than  with  low  Wing  Root  Chord  Fraction  values.  This  is  indicated  by  the  dense 
distribution  of  black  dots  in  the  upper  region  of  the  scatterplot,  with  almost  no  white  space 
visible,  and  mild  increase  in  white  space  visible  in  the  lower  region  of  the  scatterplot. 

The  thorough  sampling  observed  in  Figure  43  suggested  that  the  cases  which  were  evaluated 
sampled  the  design  space  fairly  thoroughly,  and  thus  that  inferences  about  the  design  space 
could  be  safely  drawn  from  the  distribution  of  test  cases.  Parameters  which  tended  to  take 
high  values  included  Nose  Top  Curvature  2,  Nose  Bottom  Curvature  2,  Nose  Droop,  Wing 
Root  Chord  Fraction,  Wing  Outboard  Taper  Ratio,  Wing  Dihedral,  Wing  Maximum  Camber 
Focation,  Wing  Airfoil  Thickness-to-Chord  Ratio,  and  Wing  heading  Edge  Radius  Parameter. 
Parameters  which  tended  to  take  low  values  included  Vehicle  Scale,  Wing  Span,  Wing  Twist, 
Wing  Incidence,  Wing  Camber,  Wing  Maximum  Thickness  Focation,  and  Vertical  Tail  heading 
Edge  Sweep. 


141 

Approved  for  public  release;  distribution  unlimited 


0.05  0.090.120.150.18  0.5  0.6  0.7  0.8  0.9  0.1  0.40.60.8  1  1.2 

Fuselage-  Nose- 

Loft  End  Droop  Wing  -  Span 


Figure  43:  Partial  Scatterplot  for  All  Evaluated  Cases 

Curiously,  there  were  two  strong  clusters  of  cases  observed  for  Nose  Spatularity  Ratio;  one 
group  had  values  at  or  near  the  minimum  of  the  range,  while  another  cluster  had  values  near 
the  midpoint  of  the  range.  Using  JMP,  when  points  are  selected  within  one  scatterplot  those 
points  are  highlighted  in  all  other  scatterplots,  allowing  the  user  to  see  the  distribution  of 
those  points  with  regard  to  other  parameters  without  having  to  generate  a  new  set  of  plots.  When 
the  cases  with  larger  Spatularity  values  were  highlighted,  it  was  revealed  that  these  cases 
almost  exclusively  have  very  low  Nose  Fineness  Ratios.  Conversely,  cases  with  small 
Spatularity  values  tended  to  have  higher  Nose  Fineness  Ratios,  although  some  cases  with  both 
small  Spatularity  values  and  small  Fineness  Ratios  were  observed  as  well. 

8.2.3  Generation  of  Low-Fidelity  Data 

Before  sampling  began,  a  decision  had  to  be  made  as  to  whether  APAS  could  be  used 
directly  as  the  low-fidelity  data  source,  or  if  it  had  to  be  replaced  with  a  surrogate  model. 

When  multi-fidelity  methods  are  used,  contour-based  sampling  requires  such  a  large  quantity  of 
low-fidelity  estimates  that  supplementary  surrogate  modeling  could  be  a  necessity.  As  in  the  9- 
dimension  example  given  earlier,  a  large  space-filling  set  of  cases  was  analyzed  with  APAS  and 
a  neural  network  fit  to  the  results.  Because  the  per-evaluation  cost  of  APAS  was  so  low,  a  very 
large  space-filling  data  set  was  generated.  However,  it  was  uncertain  whether  all  of  the  data 
would  be  necessary.  Furthermore,  the  portion  of  the  data  that  might  be  used  was  unlikely  to 
be  a  regular  fraction  of  the  whole.  To  account  for  this,  a  sampling  approach  was  sought  such 
that  even  arbitrary  fractions  of  the  sample  set  would  still  have  good  space-filling  characteristics. 
This  search  led  to  the  use  of  Sobol  sequences. 
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To  allow  a  subset  of  the  data  to  be  used  without  saerifieing  the  spaee-fdling  qualities,  the 
Matlab  funetion  sobolset  from  the  Statisties  Toolbox  was  used  to  generate  data  using  a  Sobol 
sequence.  A  Sobol  sequence  is  a  series  of  quasi-random  numbers  that  have  good  spaee-fdling 
properties.[21,  1 14,  180]  The  sequence  can  be  generated  for  relatively  little  computational  effort 
and  subsets  of  the  sequence  also  have  relatively  good  spaee-fdling  properties. 


Figure  44  shows  how  a  Sobol  sequence  can  be  sampled  sequentially  without  loss  of  spaee-fdling 
characteristics.  Figure  44a  depicts  the  first  100  samples  from  the  sequence.  Figure  44b  plots 
the  first  200  samples,  and  Figure  44c  plots  the  first  300  samples.  Thus,  samples  from  a  Sobol 
sequence  can  be  added  progressively. 


Figure  44:  Subsets  of  a  Sobol  Sequeuce 

Note,  however,  that  in  all  three  sample  sets,  a  gap  is  left  unsampled  near  (0.6,  0.3)  and  the 
upper  region  is  somewhat  more  heavily  sampled  than  the  lower.  These  results  demonstrate  that 
the  sobolset  function  requires  some  degree  of  oversight  by  the  user.  In  addition  to  the  number  of 
dimensions  desired,  the  user  can  input  values  for  two  free  parameters  (’’skip”  and  “leap”) 
which  control  which  numbers  in  the  sequence  are  used  to  create  samples.  By  varying  these  free 
parameters,  new  sequences  can  be  generated,  but  these  sequences  sometimes  have  poor  space¬ 
filling  characteristics.  A  few  sample  Sobol  sequences  were  generated  using  the  commands; 

p  =  sobolset(2,’skip’,skip,’leap’,leap); 
p  =  scramble(p,’MatousekAflfmeOwen’); 

POINTS  =  net(p,500); 

Here,  skip  and  leap  are  user-defined  Matlab  variables.  Once  Matlab  has  generated  the  sequence, 
it  will  skip  the  first  skip  points  in  the  sequence  and  then  use  every  leap^^  point  as  a  sample.[121] 

Three  Sobol  sequences  are  demonstrated  in  Figure  45.  Figure  45a  uses  a  skip  value  of  66  and  a 
leap  value  of  32  to  select  500  points;  these  are  the  same  parameters  that  were  used  to 
generate  the  sample  sets  in  Figure  44.  Figure  45b  is  based  on  a  skip  value  of  36  and  a  leap 
value  of  16.  The  samples  from  this  sequence  exhibited  significant  clumping,  and  the  lower 
edge  of  the  space  was  sampled  heavily  while  other  regions  were  neglected.  Figure  45c  is  based 
on  a  skip  value  of  42  and  a  leap  value  of  29.  Once  again,  clumping  was  observed;  in  this 
example,  however,  the  samples  did  not  address  the  full  horizontal  range  of  the  space. 
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Figure  45:  Distribution  of  Points  from  Sobol  Sequences 


These  observations  were  the  reason  why  Sobol  sequenees  were  not  selected  for  most  space¬ 
filling  sampling  in  this  research;  there  was  too  much  risk  that  the  Sobol  sequence  that  was 
generated  would  have  poor  space-filling  characteristics,  which  would  adversely  affect  the 
experiment. 

However,  this  risk  was  considered  acceptable  for  generating  training  data  from  APAS  when  the 
per-analysis  cost  was  cheap  and  only  a  portion  of  the  total  samples  would  be  used.  Checks  were 
added  to  ensure  that  the  sequence  sampled  the  full  range  of  each  parameter. 

After  the  Sobol  sequence  was  created,  APAS  was  used  to  analyze  the  points.  The  results  were 
then  modeled  using  the  neural  network  tool  BRAINN.  At  first,  the  tool  was  applied  to  the 
full  set  of  500,000  cases,  which  made  the  fitting  process  quite  slow  and  time-consuming.  It 
was  found  that  networks  trained  on  100,000  cases  were  equally  accurate  and  could  be  trained 
much  more  quickly. 

Because  the  APAS  model  would  not  include  control  surface  deflections  and  only  zero-sideslip 
flight  conditions  were  being  simulated,  it  was  expected  that  once  again  any  lateral  responses 
estimated  by  APAS  would  be  spurious.  No  surrogate  models  were  trained  to  reproduce  lateral 
responses  from  APAS.  Instead,  the  primary  goal  of  the  low-fidelity  surrogate  models  was  to 
accurately  predict  pitching  moment.  A  partial  list  of  the  goodness-of-fit  metrics  for  the  Cm 
models  at  each  flight  condition  is  given  in  Table  8.  These  metrics  included  the  distribution  of 
the  Model  Fit  Error  (MFE),  which  measured  how  well  the  model  fit  the  training  data,  and  the 
distribution  of  the  Model  Representation  Error  (MRE),  which  measured  how  well  the  model  fit 
new  data  points  that  were  not  used  in  training.  These  error  distributions  were  described  by  the 
mean  {p)  and  standard  deviation  (a)  of  the  observed  prediction  errors. 

The  two  supersonic  surrogates  had  good  performance  for  the  most  part.  Both  the  training  and 
test  R^  values  were  very  close  to  1 .  The  Model  Fit  Error,  which  quantified  how  well  the 
surrogate  model  reproduced  its  training  data,  had  a  mean  close  to  zero  and  a  standard  deviation 
that  was  not  overly  large,  if  a  bit  larger  than  might  be  desired.  The  same  was  true  for  the  Model 
Representation  Error,  which  quantified  how  well  the  surrogate  reproduced  a  separate  set  of  test 
data  which  was  not  used  in  its  training. 
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Table  8:  Goodness  of  Fit  Metrics  for  Surrogates  of  APAS  Data 


R2 

Training 

R2 

Test 

MFE 

MFE 

a 

MRE 

/X 

MRE 

a 

Mach  0.3 

a  15° 

0.927 

0.844 

-2.1x10“® 

0.581 

0.0160 

0.901 

Mach  0.8 

a  0° 

0.946 

0.812 

3.2x10“^ 

0.317 

0.00846 

0.648 

Mach  2.5 

a  15° 

0.997 

0.995 

1.6x10“^ 

0.114 

0.00130 

0.140 

Mach  2.5 

a  40° 

0.998 

0.998 

5.8x10“^ 

0.242 

0.00338 

0.255 

The  subsonic  surrogates  were  not  quite  as  accurate.  The  test  values  were  below  0.9,  which 
may  be  cause  for  concern,  and  the  standard  deviations  of  both  the  Model  Fit  Error  and 
Model  Representation  Error  were  significantly  larger.  It  was  decided  to  proceed  with  these 
surrogates  rather  than  trying  to  improve  the  models;  this  would  serve  as  a  test  to  determine  the 
sensitivity  of  the  method  with  respect  to  the  accuracy  of  the  low-fidelity  data.  In  effect,  a  less- 
accurate  source  of  data  was  being  used  for  the  two  subsonic  conditions. 

Once  the  low-fidelity  surrogates  were  completed,  the  main  effort  of  this  experiment  could  begin 
in  earnest. 

8.2.4  Null  Hypothesis:  Space-Filling  Samples 

The  null  hypothesis  for  this  experiment  was  that  space-filling  sampling  would  result  in  the 
most  accurate  surrogate  models  for  the  given  test  points.  Space-filling  sampling  in  general, 
and  Latin  hypercube  sampling  in  particular,  has  been  known  to  be  an  effective  approach  for 
understanding  and  modeling  response  behavior.  [126]  Such  sampling  is  particularly  useful 
when,  as  Cioppa  and  Lucas  put  it,  “there  may  be  multiple  responses  of  interest  and  little  a 
priori  knowledge  about  the  forms  that  the  response  function  may  take.”[30] 

Space-filling  methods  usually  select  all  samples  simultaneously,  forcing  the  user  to  decide  in 
advance  how  many  samples  would  be  necessary.  Qian’s  method  of  nested  Latin 
hypercubes[152]  allows  some  flexibility  in  this  regard;  a  nested  Latin  hypercube  (NLHC) 
contains  multiple  space-fdling  subsets,  referred  to  as  levels,  which  give  the  user  a  variety  of 
potential  sample  set  sizes  without  losing  the  space-filling  quality.  However,  the  number  of  cases 
in  each  space-filling  subset  grows  at  a  geometric  rate,  with  the  smallest  possible  growth  rate 
being  a  doubling  of  size  at  each  level.  If  more  than  a  handful  of  levels  are  required,  the  rate  of 
growth  can  quickly  become  significant.  This  was  highlighted  during  the  initial  RBS  effort: 
when  the  8,000-case  subset  was  insufficient  with  regard  to  model  accuracy,  the  researchers  had 
to  run  another  8,000  cases  to  produce  another  space-filling  data  set,  doubling  the  computational 
effort. 
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Consider  the  nested  Latin  hypercube  results  depicted  inFigure  40Figure  39,41  &Figure  4 1 .  The 
plotted  results  include  sample  sets  of  500, 1,000  &  2,000  cases;  the  next  level  of  thatNLHC 
would  be  4,000  cases,  a  significant  increase  compared  to  the  number  of  cases  in  the 
competing  data  sets.  It  would  be  preferable  if  the  step  size  between  space-filling  sets  was 
more  uniform,  since  this  would  allow  the  user  a  greater  degree  of  granularity  with  respect  to 
sampling  size. 

Qian  has  also  proposed  Sliced  Latin  Hypercube  designs,[153]  which  are  intended  for  problems 
with  one  or  more  discrete  variables.  For  the  purpose  of  clarity,  the  following  discussion  will 
assume  that  only  one  discrete  variable  is  present. 

For  each  possible  setting  of  the  discrete  variable,  one  hypercube  is  generated.  This  hypercube, 
referred  to  as  a  “slice,”  has  good  space-filling  qualities  with  respect  to  the  continuous 
variables.  When  all  slices  are  combined,  the  resulting  sample  design  is  also  a  true  hypercube, 
with  good  stratification  over  every  dimension. 

To  illustrate  the  concept,  consider  sampling  along  a  single  continuous  variable  with  4  slices, 
each  slice  consisting  of  4  samples.  Each  input  variable  would  be  split  into  16  equal  “bins”. 

The  combined  hypercube  would  place  1  sample  in  each  bin.  The  bins  would  be  grouped  into 
sets  of  4,  and  each  slice  would  put  a  single  sample  in  each  group. 

For  example,  one  group  would  consist  of  bins  1,  2,  3  &  4;  another  group  would  consist  of  bins 
5,  6,  7  &  8,  etc.  The  first  slice  (i.e.,  the  first  hypercube  in  the  set)  would  include  one  bin  from 
each  group,  such  as  (3, 5, 1 1, 13).  The  second  slice  would  also  include  one  bin  from  each  group, 
such  as  (1, 6, 9, 16).  No  bin  would  appear  in  more  than  one  slice,  and  each  slice  would  include 
only  one  bin  per  group. 

After  the  algorithm  to  create  sliced  Latin  hypercubes  was  implemented,  a  number  of  two- 
dimensional  sliced  Latin  hypercubes  were  generated  and  investigated  visually,  with  regard  to 
both  the  distribution  of  points  in  each  slice  and  the  distribution  of  points  in  the  combined  set. 
Based  on  this  admittedly  qualitative  examination,  it  was  observed  that  some  hypercubes 
exhibited  a  tendency  toward  clumping,  with  some  regions  being  sampled  more  densely  than 
others. 

To  reduce  the  risk  of  clumping  and  sample  the  space  more  evenly,  a  variation  of  sliced  Latin 
hypercubes  was  devised.  This  variation  would  combine  independently  generated  hypercubes 
rather  than  permutations  of  a  common  core.  Because  the  overall  sample  set  was  composed  of 
multiple  unrelated  hypercubes,  it  was  referred  to  as  a  “stacked  Latin  hypercube.” 

By  now,  the  reader  should  be  aware  that,  due  to  the  computational  resources  available  to  this 
effort,  there  was  low  emphasis  placed  on  finding  elegant  solutions.  It  was  likely  that,  by  the  time  a 
complex  approach  could  be  implemented,  a  simpler  brute-force  method  might  already  have 
accomplished  the  task  at  hand.  This  mindset  informed  the  method  of  generating  stacked  Latin 
hypercubes. 

To  create  a  stacked  Latin  hypercube,  the  user  first  determines  the  desired  step  size  and  total 
number  of  cases.  The  total  number  of  cases  should  be  an  integer  multiple  of  the  step  size.  Lor 
this  effort,  it  was  decided  that  a  step  size  of  500  cases  would  be  a  good  compromise  between 
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granularity  and  simplicity;  the  total  number  of  cases  was  set  to  8,000,  whieh  was  expected  to 
be  close  to  or  in  excess  of  the  largest  viable  data  set  for  Kriging  models.  Thus,  16  hypereubes, 
each  with  500  cases,  would  be  combined  to  ereate  the  overall  set  of  8,000  cases.  The  challenge 
was  to  select  a  set  of  16  hypercubes  sueh  that  the  eombination  of  all  cases  was  well-spaeed. 

A  pool  of  800  Latin  hypercubes  was  generated,  each  of  which  included  500  cases  of  49 
dimensions.  This  pool  size  might  seem  small  to  researehers  more  accustomed  to  Monte  Carlo 
analyses  using  tens  of  thousands  of  cases,  but  such  instincts  can  be  misleading.  The  pool  was 
set  at  800  hypereubes  beeause  that  value  offered  50  hypereubes  for  each  1  being  seleeted. 

A  selection  of  k  members  from  a  pool  of  n  options,  when  the  order  of  selection  is  unimportant,  is 
known  in  probability  theory  as  a  “combination.”[76]  The  number  of  possible  eombinations, 
given  n  and  k  ,  can  be  ealeulated  by: 


Cl  = 


n\ 


{n  -  kj.  k\ 


Here,  !  indicates  the  factorial  of  the  number,  calculated  as: 


(36) 


(37) 


By  convention,  0!  is  equal  to  1.  The  number  of  possible  eombinations  can  grow  much  faster 
than  might  be  expected.  For  example,  if  the  objeetive  ws  to  select  3  options  from  a  pool  of  10, 


the  number  of  possible  combinations  would  be 


10!  ,  .  ,  .  3,628,800  nn 

- which  IS  equal  to  ^ ^ - or  120 

31x7!  6x5,040 


combinations.  Due  to  the  nature  of  factorials,  the  number  of  combinations  increases  very 
rapidly  with  the  size  of  the  pool. 


Although  the  direet  ealeulation  of  800!  in  common  programs  like  Microsoft  Excel  2007  or 
Matlab  R2010b  returns  an  answer  of  infinity  -  factorials  grow  very  rapidly,  and  100!  is  on  the 

order  of  10*^^  -  the  inspection  of  Equation  37  shows  that  when  n  is  much  larger  than  k  ,  many 
of  the  terms  in  n\  and  (n-A:)  will  cancel  out,  resulting  in  a  numerator  and  denominator  that 
are  mueh  less  elegant  but  far  more  computationally  tractable: 


c; 


800 

16 


800! 

7841x16! 


785x786x---x799x800 

16! 


1.16x10^^ 


(38) 


Thus,  even  a  pool  of  800  hypereubes  will  offer  a  very  large  number  of  possible  combinations. 

Genetic  algorithms  had  been  used  for  other  portions  of  this  research  effort.  The  generation  of  a 
stacked  Eatin  hypercube  lent  itself  directly  to  the  use  of  genetic  algorithms.  The  “chromosome” 
for  this  problem  consists  of  a  binary  string  800  bits  long,  with  one  bit  for  each  hypercube  in  the 
pool.  A  1  would  indicate  that  that  hypercube  was  included,  while  a  0  would  indieate  that  the 
hypereube  was  excluded.  The  number  of  hypereubes  included  was  constrained  sum  to  16,  i.e., 
there  must  be  16  hypereubes  for  each  population  member  so  that  the  total  number  of  points  in 
the  staeked  Eatin  hypercube  would  be  8,000. 
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Genetic  algorithms  are  well-suited  for  problems  which  are  discontinuous,  highly-dimensional, 
noisy  and/or  multimodal.  The  creation  of  a  stacked  Latin  hypercube  was  known  to  be  highly- 
dimensional,  since  there  were  800  dimensions,  and  discontinuous:  each  member  of  the  pool  of 
hypercubes  had  to  be  either  included  in  its  entirety  or  excluded,  and  could  not  be  partially 
included. 

The  population  size  for  this  optimization  was  500  members,  with  a  probability  of  cross-over 
of  70  percent  and  a  probability  of  mutation  of  5  percent.  Rather  than  recording  the 
performance  of  every  population  member  ever  analyzed,  the  10  best  members  of  each 
population  were  carried  over  to  the  following  population  unchanged.  Tournament  selection, 
crossover,  and  mutation  operations  were  then  used  to  create  the  other  490  population  members. 

The  fitness  function  for  the  GA  was  the  Euclidean  maximin  distance,  which  is  the  smallest 
Euclidean  distance  between  any  two  points  in  the  set.  This  distance  is  a  common  metric  to 
assess  how  well-spaced  a  set  of  cases  are,[30,  132,  147,  185]  and  in  fact  the  function  Ihsdesign 
from  the  Matlab  Statistics  Toolbox,  which  generates  Latin  hypercubes,  attempts  to  maximize  this 
metric.  [120] 

The  Euclidean  distance  between  two  points  xi  andx2  in  E  dimensions  is  calculated  as  follows: 

(7(xi ,  X2  )  = 

Here,  is  the  component  of  Xj.[132] 

Each  member  of  the  population  was  a  particular  combination  of  16  hypercubes  from  the  pool. 
The  members  were  evaluated  with  the  Euclidean  maximin  distance;  the  larger  this  value,  the 
farther  apart  the  two  closest  points  in  that  set  of  cases,  and  the  better  those  cases  were  spread  out 
throughout  the  space. 

After  100  iterations,  the  best  stacked  hypercube  had  a  maximin  distance  of  1.68.  To  evaluate 
this  result,  a  comparison  was  made  using  Latin  hypercubes.  A  set  of  50  Latin  hypercubes, 
each  of  which  had  49  dimensions  and  8,000  cases,  was  generated  using  Matlab’s  Ihsdesign 
function  and  evaluated  using  the  maximin  metric.  The  average  maximin  distance  for  these 
hypercubes  was  1.56  and  the  best  hypercube  had  a  maximin  score  of  1.61.  These  results 
showed  that  the  stacked  Latin  hypercube  approach  could  produce  a  combined  set  with  better 
space-filling  characteristics  than  a  single  Latin  hypercube  designed  to  have  good  maximin 
spacing.  In  light  of  this  performance,  the  genetic  algorithm  was  curtailed  at  that  point.  The 
entire  optimization  took  roughly  17  hours  on  a  shared  computing  resource  with  four  2.66  GHz 
processors  and  4  gigabytes  of  RAM,  although  there  is  no  way  to  determine  whether  any  other 
users  were  accessing  the  resource  at  the  time. 

Good  performance  was  demonstrated  for  the  full  stacked  set,  but  as  yet  there  was  no  evidence 
that  a  subset  of  the  full  stacked  set  would  also  have  desirable  space-filling  characteristics.  A 
sorting  algorithm  was  developed  which  took  the  1 6  hypercubes  which  make  up  the  stacked  Latin 
hypercube  and  determined  the  best  order  in  which  to  use  them  to  maximize  the  maximin 
distance  at  each  level.  Thus,  the  first  hypercube  would  be  the  one  that  has  the  largest  maximin 
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score  out  of  the  1 6  hypercubes  in  the  staek.  The  remaining  1 5  hypereubes  were  one-by-one 
combined  with  the  first;  the  one  which  produced  the  best  2-hypereube  maximin  score  was 
designated  the  second  hypercube.  The  third  hypercube  is  seleeted  by  combining  the  14 
remaining  hypercubes  with  the  first  and  second,  and  so  on. 

The  resulting  stacked  Latin  hypercube  was  then  compared  against  standard  Latin  hypercubes 
and  nested  Latin  hypercubes.  Each  space-filling  approaeh  was  assessed  for  various  numbers  of 
cases,  from  500  up  to  8,000  cases,  in  steps  of  500.  For  each  number  of  cases,  at  least  500 
standard  Latin  hypercubes  were  generated  using  Matlab’s  Ihsdesign  function  and  assessed 
using  the  Euclidean  maximin  distance.  The  standard  deviation  of  the  resulting  distances  was 
less  than  0.025  for  all  levels. 

The  nested  Latin  hypercube  cases  had  a  minimum  size  of  500  cases  and  grew  by  a  factor  of  2, 
so  each  NLHC  had  space-fdling  levels  of  500,  1,000,  2,000,  4,000,  &  8,000  cases.  At  least  500 
NLHCs  were  generated;  each  NLHC  was  then  assessed  at  each  level  of  cases.  If  a  partieular 
number  of  oases  did  not  correspond  to  a  space-filling  set,  such  as  1 ,500  cases,  no  assessment  was 
made.  The  average  Euclidean  maximin  distance  at  each  level  was  calculated. 

Table  9  gives  the  Euclidean  maximin  distance  for  the  stacked  Latin  hypercube  at  each  level,  as 
well  as  the  average  results  for  the  standard  and  nested  Latin  hypercubes.  The  most  obvious 
aspect  of  this  table  is  the  gaps  that  result  from  the  geometric  growth  of  NLHCs,  which  are 
relatively  small  at  first  but  become  more  signifioant  at  higher  levels.  Another  interesting 
observation  is  the  degree  of  similarity  between  the  nested  &  slioed  Latin  hypereubes.  Unlike 
the  other  methods,  these  two  had  maximin  distances  consistent  to  at  least  two  decimal  places 
for  every  point  of  eomparison.  It  is  believed  that  these  results  stem  from  the  fact  that  both 
methods  use  random  permutations  of  the  smallest  spaee-filling  set  to  generate  larger  data  sets. 
The  use  of  permutations,  rather  than  the  ereation  of  independent  &  unique  hypereubes,  may 
limit  the  potential  of  the  designs  with  respeet  to  optimal  space-filling  eharaeteristics.  It  must  be 
said,  however,  that  both  methods  generate  sample  sets  quite  rapidly. 


Table  9:  Minimum  Spacing  of  Hypereubes  (HCs)  of  Various  Sizes 
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To  determine  the  time  required  to  generate  a  given  sample  design,  each  method  was  used  to 
generate  a  set  of  8,000  cases  over  49  dimensions;  this  was  repeated  1,000  times  for  each 
method,  with  the  exception  of  the  stacked  Latin  hypercube.  The  average  time  per  standard  Latin 
hypercube  was  8.8  seconds,  while  the  average  sliced  Latin  hypercube  time  was  a  mere  0.38 
seconds.  Nested  Latin  hypercubes  were  generated  in  an  average  of  5.3  seconds.  It  was  curious 
that  the  nested  Latin  hypercubes  could  be  generated  more  rapidly  than  standard  hypercubes 
despite  the  complexity  of  the  nesting  method.  Upon  review  of  the  function  description  for 
Ihsdesign,  used  to  create  the  standard  hypercubes,  it  was  found  that  the  function  may  iterate  up 
to  five  times  to  improve  the  distribution  of  points  according  to  the  space-filling  criterion,  which 
by  default  is  the  Euclidean  maximin  distance.  This  may  also  explain  why  the  standard 
hypercubes  have  a  larger  maximin  distance  than  the  NLHC  cases,  even  at  the  smallest  pool  size. 

Lastly,  it  was  observed  that  the  stacked  Latin  hypercube  did  in  fact  offer  better  space-filling 
characteristics  as  measured  by  Euclidean  maximin  distance.  This  indicated  that  the  stacked 
Latin  hypercube  had  very  good  space-filling  characteristics  while  enabling  progressive 
sampling  with  a  linear  rate  of  data  set  growth.  The  negative  aspect  of  this  approach  was  evident 
in  the  amount  of  effort  required  to  assemble  a  good  stacked  Latin  hypercube.  Although  a  stacked 
Latin  hypercubes  was  computationally  intensive  to  generate,  the  effort  was  rewarded  with  a  set 
of  samples  that  allowed  smooth  linear  growth  while  retaining  good  space-filling  qualities. 

This  technique  would  form  the  null  hypothesis,  which  asserted  that  space-filling  samples  would 
be  the  most  effective  sample  distribution  method  for  surrogate  modeling. 
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The  cases  from  the  stacked  Latin  hypercube  were  analyzed  using  CartSD  and  grouped  into 
the  appropriate  data  pools.  Recall  that  roughly  70  percent  of  the  cases  that  were  analyzed 
for  the  nine-dimensional  problem  were  suitable  for  modeling,  i.e.,  had  converged  for  all  3 
flight  conditions.  For  this  larger  problem,  roughly  90  percent  of  the  analyses  were  found  to  be 
converged  for  all  flight  conditions,  which  suggests  one  of  two  primary  explanations;  either  the 
Mach  2.5,  a  0°  flight  condition  accounted  for  the  bulk  of  unconverged  cases,  since  it  was 
replaced  by  the  Mach  2.5,  a  0  &  40°  conditions  for  this  experiment,  or  the  default  values  used 
to  reduce  the  design  space  to  9  dimensions  produced  configurations  that  were  more  likely  to 
exhibit  poor  computational  convergence.  The  various  data  pools  were  used  to  train  Kriging 
models  with  linear  trends  and  anisotropic  Gaussian  correlation  functions.  These  surrogates 
were  then  evaluated  using  the  test  cases  identified  earlier.  By  comparing  the  predicted  Cm 
values  against  CartSD  results,  the  Root  Mean  Squared  Error  (RMSE)  would  be  calculated  for 
each  model.  Those  RMSE  values  are  presented  in  Eigure  46.  Eor  the  most  part,  the  results 
showed  very  limited  improvement  or  mild  degradation  with  increasing  data  pool  size,  which  was 
unexpected.  The  model  for  Cm  at  Mach  2.5,  a  15°  showed  an  1 1  percent  improvement  in  RMSE 
for  4,000  cases  compared  to  500  cases,  as  seen  in  Eigure  46c;  all  other  Cm  models  became  less 
accurate  as  more  cases  were  added,  at  least  with  respect  to  the  test  cases. 
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Figure  46:  Prediction  RMSE  for  Stacked  Latin  Hypercube  Sampling 


The  Cm  surrogate  models  for  the  Mach  0.3  flight  condition  were  investigated  to  identify  why 

model  improvement  was  minimal  or  negative.  Eor  those  models,  the  ^  values  for  the 
underlying  trend  model  -  which  are  similar  to  the  coeflhcients  in  a  response  surface  model  - 
showed  mild  convergence  behavior  as  the  available  data  pool  grew,  but  no  significant 
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changes  overall.  Some  weights  progressively  approaehed  zero,  inchoating  that  as  more  data  is 
available,  the  oorresponding  parameters  were  found  to  be  less  important  in  determining  the 
response,  but  those  weights  were  small  to  begin  with  and  thus  did  not  strongly  aflfeot  the 
models.  This  observation  indioated  that  the  underlying  trend  that  was  fit  to  the  training  data 
remained  approximately  oonstant  as  more  training  data  was  added. 

Reoall  that  Kriging  prediotions  depend  on  two  oomponents:  the  underlying  trend  model,  whioh 
is  similar  to  a  response  surfaoe  model,  and  the  oorrelation-based  oorreotion  term.  The 
oorreotion  term  is  used  to  oapture  divergenoes  from  the  underlying  trend  based  on  the  degree 
to  whioh  other  nearby  data  points  deviated  from  that  trend.  When  the  Kriging  surrogate  model 
is  fit  to  the  training  data,  the  ooeflhoients  for  the  trend  and  the  oorreotion  term  are  optimized 
to  best  fit  the  observations.  [108]  The  Kriging  surrogate  produoed  by  the  DACE  toolbox 
utility  dacefit  is  represented  by  a  data  struoture  whioh  inoludes  the  trend  model  ooelfioients 
as  well  as  various  other  parameters  relevant  to  the  Kriging  model. 

When  the  ooelfioients  for  the  oorrelation  term  were  investigated,  they  were  found  to  be 
unohanged  from  the  initial  guess  values.  This  oan  ooour  when  the  algorithm  whioh  optimizes 
the  oorreotion  ooelfioients  oould  not  identify  any  new  values  that  would  improve  the  model 
aoouraoy.  It  was  believed  that  this  behavior  was  due  to  the  relative  sparsity  of  the  overall 
sampling.  Even  for  the  larger  data  pool  sizes,  the  points  may  not  have  been  olose  enough 
together  that  the  oorreotion  term  is  produoing  any  oonsistent  improvement. 

This  was  tested  by  evaluating  the  oorrelation  between  the  test  oases  and  the  training  data  sets. 
The  Kriging  models  in  this  study  all  used  an  anisotropio  Gaussian  oorrelation  funotion,  whioh 
was  previously  defined  in  Equation  3  on  page  80.  The  implementation  of  this  funotion  within  the 
DACE  toolbox  uses  a  slightly  different  notation: 


k(u,  v)  =  Yl^xp  -  0, i\u.  -  v.\J 


(40) 


i=l 


In  this  version,  the  6^  term,  whioh  indioates  the  oorrelation  ooeffioient  for  dimension  i ,  is 
multiplioative  rather  than  divisive,  u.  and  v,  represent  the  ih  oomponent  of  points  u  and  v  , 
respeotively.  When  the  Kriging  models  were  fit  to  the  data,  the  6^  values  were  initialized  at  a 
value  of  10  and  allowed  to  vary  between  0.1  and  20.  Smaller  0.  values  would  indioate  that 
there  is  oorrelation  between  oases  whioh  are  father  away. 


It  should  be  noted  that  the  points  u  and  v  must  be  normalized  before  oorrelation  oan  be 
oaloulated.  The  DACE  toolbox  funotion  dacefit  transforms  the  input  oases  and  responses  so 
that  the  normalized  variables  have  a  mean  of  0  and  a  standard  deviation  of  l.[?]  Eaoh  variable 
is  normalized  by  subtraoting  the  mean  of  the  known  values  for  that  variable  and  then  dividing 
by  the  observed  standard  deviation  of  that  variable: 


(41) 
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Here,  K  .  is  the  set  of  observed  values  for  the  7'*  variable  in  the  set  of  points  S  .  J  and 
cr(j  J  represent  the  mean  and  standard  deviation  of  the  observed  values,  respeetively.  The 
response  values  are  normalized  in  the  same  manner. 

Once  the  test  cases  had  been  normalized,  the  correlation  between  each  test  case  and  the  training 
cases  could  be  calculated.  These  correlation  values  can  range  from  0  to  1 ,  with  0  indicating 
no  correlation  and  1  indicating  perfect  correlation.  For  the  500-case  data  set,  the  largest 
correlation  value  between  any  test  case  and  any  training  case  was  1.6  x  10"^^'^.  When  the 
4,000-case  set  is  used  instead,  this  correlation  increases  to  4.2  x  10"^^' .  Effectively,  the  training 
data  did  not  provide  any  knowledge  about  how  far  the  test  cases  were  likely  to  deviate  from  the 
underlying  trend.  As  a  result,  the  correlation  term  of  the  Kriging  model  went  to  zero  for  all 
test  cases,  and  the  surrogate  became  a  simple  least-squares  linear  fit. 

Based  on  these  results,  it  appears  that  the  sampling  is  too  sparse  for  the  correlation  term  to  affect 
the  Kriging  prediction.  This  reduces  the  Kriging  model  to  the  underlying  trend  model  which  is 
fit  to  the  data  using  the  generalized  least-squares  solution.  [108] 

Note  that  this  observation  did  not  invalidate  the  alternative  hypothesis  that  the  proposed 
approach  would  improve  model  accuracy.  When  a  model  is  fit  using  a  least  squares  method, 
it  attempts  to  fit  every  data  point  equally  well.  If  the  model  is  fit  to  space-filling  samples,  the 
entire  design  space  is  given  equal  weight.  Conversely,  if  regions  of  desirable  response 
behavior  can  be  identified  and  emphasized  by  adaptive  sampling,  the  resulting  models  would 
exhibit  better  accuracy  in  those  regions  because  the  over-representation  of  such  cases 
artificially  weights  those  regions  as  being  more  important.  The  evaluation  of  the  alternative 
hypothesis  therefore  went  ahead  as  planned. 

8.2.5  Alternative  Hypothesis:  Multi-Fidelity  Contour-Based  Sampling 

Opposing  the  null  hypothesis  in  this  experiment  was  the  alternative  hypothesis,  which  asserted 
that  the  multi-fidelity  contour-based  sampling  approach  would  produce  surrogate  models  with 
better  predictive  accuracy  for  the  cases  of  interest.  Once  again,  a  linear  underlying  trend 
model  and  an  anisotropic  Gaussian  correlation  function  were  used.  This  illuminated  the  effects  of 
increased  space-filling  sampling  while  ensuring  that  both  the  space-filling  and  adaptively- 
sampled  methods  began  from  an  equal  footing. 

As  previously  mentioned,  the  time  and  computational  effort  required  to  train  or  apply  a  Kriging 
model  grows  geometrically  as  the  number  of  data  points  increases.  [136]  In  the  previous 
experiment,  Kriging  models  were  trained  using  no  more  than  1,000  cases.  In  deference  to  the 
larger  data  sets  that  would  be  modeled  for  this  study,  the  number  of  cases  per  batch  was 
reduced:  whereas  for  the  nine-dimensional  study  70  cases  were  selected  per  batch,  here  only  15 
cases  would  be  selected  per  batch.  This  decision  was  intended  to  reduce  the  amount  of  time 
per  batch.  Smaller  batches  also  meant  that  the  algorithm  would  be  updated  more  often, 
potentially  leading  to  better  selection  of  samples  as  each  response  was  modeled  more 
accurately. 

To  evaluate  how  the  size  of  the  warm  start  would  affect  the  performance  of  the  proposed 
approach,  the  algorithm  was  initialized  using  various  levels  of  the  stacked  Latin  hypercube, 
which  contained  space-fdling  sets  of  data  in  multiples  of  500  cases.  The  first  6  of  these  sets,  up 
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to  a  maximum  of  3,000  cases,  were  used  to  investigate  how  the  integrated  algorithm  would  be 
affected  by  inereasing  the  quantity  of  data  available.  It  was  expeeted  that  the  integrated 
algorithm  would  be  progressively  more  eflfeetive  as  more  spaee-fdling  data  were  used,  due  to 
the  greater  information  about  response  behavior. 

A  simple  study  was  performed  to  determine  how  the  size  of  the  initial  data  pool,  or  “warm 
start,”  would  aflfeet  the  performanee  of  the  adaptive  sampling  algorithm.  A  larger  initial  data 
pool  might  improve  the  surrogate  model  that  would  be  used  by  the  adaptive  sampling  algorithm, 
leading  to  a  better  assessment  of  the  available  candidates  and  thus  a  larger  improvement  in 
predietive  accuracy.  However,  a  larger  data  pool  eorresponds  to  higher  surrogate  model  training 
eosts.  As  a  result,  evaluating  a  set  of  eandidates  would  take  longer.  This  study  was  intended  to 
determine  the  point  at  which  the  inereased  evaluation  eosts  began  to  outweigh  the  improved 
eandidate  seleetion  performanee. 

Five  different  sizes  of  warm  start  were  investigated.  The  sizes  were  based  on  the  first  5  levels 
of  the  staeked  Latin  hypercube,  and  eorresponded  to  training  data  sets  of  around  500,  1,000, 
1,500,  2,000  and  2,500  oases.  Using  eaoh  warm  start,  30  new  samples  were  seleoted  (in  batohes 
of  15).  To  ensure  a  fair  oomparison,  all  warm  starts  were  provided  with  the  same  eandidates 
and  test  points.  New  eandidates  and  test  points  were  used  for  eaoh  round. 

The  adaptive  sampling  algorithm  featured  3  parameters  that  oould  be  adjusted  by  the  user:  the 
number  of  eandidates  to  be  evaluated,  the  number  of  test  points  to  be  used  in  the  evaluation, 
and  the  required  probability  of  interest  (POI)  that  would  be  used  to  soreen  out  uninteresting 
eandidates.  Eaoh  of  these  parameters  affected  the  oomputational  effort  required  to  seleot  a  new 
sample.  Identifying  optimum  settings  for  these  parameters  was  left  for  future  researoh, 
espeoially  as  the  “optimum”  settings  were  likely  to  be  problem-dependent.  Instead,  3  different 
sets  of  values  were  seleoted  for  use  with  the  49-dimensional  problem.  Eaoh  set  of  values  would 
be  used  to  select  5  samples  out  of  every  batch  of  15  points. 

The  first  5  points  out  of  every  batoh  were  intended  to  be  exploratory  while  still  keeping  sample 
selection  times  low:  1 ,000  eandidates,  1 ,500  test  points,  and  a  required  POI  of  0  peroent. 

Beoause  the  POI  criteria  was  a  “greater-than”  and  not  a  “greater-than-or-equal,”  a  POI  of  0 
peroent  would  still  eliminate  some  eandidates.  The  number  of  eandidates  eliminated  would 
depend  strongly  on  the  estimated  prediotion  uncertainty:  when  unoertainty  was  larger,  more 
eandidates  would  have  a  POI  value  greater  than  zero. 

The  seoond  5  points  were  seleoted  from  a  larger  set  of  eandidates  (3,500),  whioh  would  be 
evaluated  with  a  greater  number  of  test  points  (3,500).  If  the  POI  requirement  were  not 
adjusted,  this  would  have  led  to  a  signifioant  inorease  in  analysis  time.  To  mitigate  this  effect,  a 
higher  POI  requirement  (I  peroent)  was  used.  Depending  on  the  prediotion  unoertainty,  this 
POI  requirement  would  typioally  disqualify  65-85  peroent  of  the  eandidates  for  this  problem, 
so  the  time  required  per  sample  seleetion  did  not  inorease  exoessively. 

The  final  5  points  in  eaoh  batoh  were  also  seleoted  using  3,500  eandidates  and  3,500  test  points. 
The  1  peroent  POI  requirement  of  the  previous  5  seleotions  was  fairly  restriotive,  so  that 
requirement  was  relaxed  partially  to  0.5  peroent  for  these  seleotions.  This  reduoed  POI 
requirement  would  still  disqualify  55-80  peroent  of  the  eandidates,  again  depending  on  the 
estimated  prediction  uncertainty. 
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Beginning  with  each  warm-start  set,  these  parameter  schedules  were  used  to  select  batches  of  15 
samples  at  a  time.  Those  samples  would  be  analyzed,  and  the  new  samples  would  be  appended 
to  the  associated  warm-start  set.  This  procedure  was  repeated  twice  for  each  warm  start, 
augmenting  the  initial  space-filling  cases  with  a  total  of  30  new  samples.  The  sample  selections 
were  performed  in  Matlab  2011b  on  a  shared  computing  resource  with  4  gigabytes  of  RAM 
and  two  Intel  Xeon  E5440  2.83  gigahertz  processors,  each  of  which  had  4  cores.  It  should  be 
noted  that  the  computing  resource  was  shared  and  other  users  may  have  been  active  during  this 
effort,  reducing  the  resources  that  were  used  for  this  effort. 

The  first  quantity  evaluated  was  the  average  time  per  sample  selection.  The  reader  may  recall 
that  the  computational  cost  to  build  a  Kriging  surrogate  model  grows  as  0{N^) ,  where  N  is 
the  number  of  training  data  points.  As  a  result,  larger  sets  of  warm-start  data  were  expected 
to  take  longer  to  select  each  sample.  The  time  required  to  select  a  sample  was  grouped 
based  on  the  associated  settings  for  the  algorithm  parameters  (number  of  candidate  points, 
number  of  test  points,  &  POI  requirement).  The  average  time  required  to  select  a  sample 
under  each  set  of  algorithm  parameters  was  then  calculated  for  each  warm  start  size.  The 
results  are  displayed  in  Table  10. 

Table  10:  Effects  of  Data  Pool  Size  on  Time  Per  Sample  Selection  (in  minutes) 


500 

Cases 

1,000 

Cases 

1,500 

Cases 

2,000 

Cases 

2,500 

Cases 

1 ,000  Candidates 
1,500  Test  Points 
POI  0% 

31 

74 

165 

220 

332 

3,500  Candidates 
3,500  Test  Points 
POI  1% 

48 

103 

233 

265 

392 

3,500  Candidates 
3,500  Test  Points 
POI  0.5% 

62 

130 

296 

332 

475 

As  Table  10  shows,  the  average  time  required  to  select  each  new  sample  increased  drastically 
as  the  size  of  the  warm  start  grew.  There  was  something  of  an  aberration;  the  increases  in 
average  sample  selection  times  between  the  “1,500  Cases”  and  “2,000  Cases”  columns  are 
much  smaller  than  the  increases  between  any  other  pair  of  consecutive  columns.  It  is  possible 
that  other  users  were  accessing  the  shared  computing  resource  during  part  of  this  study.  A 
surge  in  competition  for  resources  during  the  1,500-case  portion  of  the  calculation  and  reduced 
competition  during  the  2,000-case  portion  could  explain  the  observed  results;  however,  the 
distribution  of  resources  at  any  given  time  was  not  recorded,  limiting  the  author’s  ability  to 
substantiate  this  theory.  In  general,  however,  the  observed  results  did  show  that  the  time  & 
effort  required  to  select  a  new  sample  grew  rapidly. 
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The  other  aspeet  that  was  being  investigated  was  whether  the  extra  information  led  to  better 
sample-seleetion  behavior,  evideneed  by  surrogates  with  improved  predietive  aeeuraey.  A 
better  initial  surrogate  would  evaluate  candidates  more  accurately  and  thus  might  do  a  better  job 
of  selecting  candidates,  resulting  in  more  rapid  improvement  in  surrogate  model  accuracy. 

After  2  batches  of  points  were  selected  using  each  of  the  initial  data  sets,  the  resulting  Kriging 
models  were  evaluated  using  the  test  cases  in  order  to  assess  their  predictive  accuracy.  The 
resulting  RMSE  values  are  compared  in  Table  11.  It  was  found  that,  as  with  the  stacked  Latin 
hypercube,  there  was  not  a  strong  or  consistent  relationship  between  the  number  of  space- 
fdling  cases  and  the  predictive  accuracy  of  the  model  for  sets  of  up  to  2,500  cases. 

Table  1 1 :  Prediction  Accuracy  (RMSE)  After  Two  Batches  of  Adaptive  Samples 


500 

Cases 

1,000 

Cases 

1,500 

Cases 

2,000 

Cases 

2,500 

Cases 

Mach  0.3 

a  15° 

0.395 

0.412 

0.412 

0.410 

0.433 

Mach  0.8 

a  0° 

0.793 

0.886 

0.754 

0.717 

0.675 

Mach  2.5 

a  15° 

0.252 

0.235 

0.256 

0.263 

0.264 

Mach  2.5 

a  40° 

0.925 

0.719 

0.800 

0.805 

0.791 

More  importantly,  new  samples  could  be  selected  more  rapidly  for  the  smallest  data  set  than  for 
the  larger  sets,  which  more  than  compensated  for  the  reduction  in  available  information.  For 
example,  the  surrogate  models  based  on  1,000  space-fdling  cases  and  two  adaptively-selected 
batches  had  better  predictive  accuracy  for  the  2  supersonic  flight  conditions  than  those  based 
on  500  cases  and  2  adaptive  batches.  However,  in  the  time  required  to  select  2  batches  of 
adaptive  cases  for  the  1,000-case  set  (roughly  38  hours  for  2  batches),  it  was  possible  to  select  4 
batches  for  the  500-case  set  (roughly  35  hours  for  4  batches);  when  the  extra  2  batches  were 
included,  models  based  on  the  500-case  set  had  better  predictive  accuracy.  In  light  of  these 
results,  the  larger  data  sets  were  abandoned  and  future  sampling  efforts  would  use  the  set  of 
500  space-filling  samples  as  the  warm  start. 

8.2.6  Probability  of  False  Positives 

Although  Root  Mean  Squared  Error,  or  RMSE,  is  a  relatively  easy  way  to  quantify  the  predictive 
accuracy  of  each  surrogate,  the  result  may  be  difiicult  to  interpret  beyond  “smaller  is  better.”  To 
more  clearly  illustrate  the  accuracy  of  a  surrogate,  a  new  metric  was  developed.  This  metric  uses 
RMSE  to  calculate  the  likelihood  that  a  poorly-performing  configuration  will  be  mistakenly 
predicted  to  havegoodperformance.  This  likelihood  shall  be  referred  to  as  the  “probability  of  a 
false-positive”  orPOFP. 
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Root  Mean  Squared  Error  is  an  approximation  of  the  variance  of  the  observed  prediction  error. 
The  “true”  response  for  a  given  configuration,  as  calculated  by  CartSD,  can  be  approximated 
by  a  normal  distribution  with  the  predicted  response  value  (0  in  this  case)  as  its  mean  and  the 
RMSE  of  the  surrogate  as  its  standard  deviation.  Elsing  the  mathematical  technique  presented 
earlier,  the  portion  of  the  distribution  that  lies  outside  of  ±0. 1  can  be  calculated.  Specifically, 
the  user  can  calculate  the  likelihood  that,  if  the  surrogate  predicts  that  some  configuration  has 
a  pitching  moment  coelficient  of  exactly  0,  the  true  pitching  moment  coefficient  as  calculated  by 
CartSD  would  be  found  to  fall  outside  the  range  of  -0.1  <  Cm  <  0.1.  In  other  words, 

POEP  estimates  the  likelihood  that  the  surrogate  model  would  evaluate  an  uncontrollable 
configuration  and  pronounce  it  controllable.  It  stands  to  reason  that  smaller  POPP  values 
would  correspond  to  better  predictive  accuracy. 

The  relationship  between  prediction  RMSE  and  POEP  for  a  constraint  of  \Cm\  <0.1  is  shown  in 
Pigure  47.  Por  prediction  RMSE  values  below  0.05,  the  POPP  is  effectively  zero.  There  is 
a  large  increase  in  POPP  as  RMSE  grows,  although  the  rate  of  increase  is  reduced  once 
RMSE  exceeds  roughly  0.3.  Clearly  the  objective  would  be  to  minimize  POPP,  and  by  extension 
minimize  prediction  RMSE.  Large  gains  can  be  made  once  prediction  RMSE  falls  below  0.3. 

As  predictive  accuracy  increases,  the  likelihood  of  a  false  positive  -  that  a  poorly-performing 
configuration  will  be  identified  as  having  good  performance  -  is  reduced.  It  must  be  noted  that 
POPP  will  depend,  not  only  on  RMSE,  but  on  the  response  range  of  interest,  and  as  a  result 
Pigure  47  is  only  valid  for  the  current  problem. 


Figure  47:  Probability  of  a  False  Positive  for  |CAf|<0.1 


8.2.7  Evaluation  of  Accuracy  for  Pitching  Moment 

Once  the  predictive  accuracy  was  quantified  for  models  of  pitching  moment  coefficient,  the 
results  for  each  flight  condition  were  plotted  for  visual  comparison.  The  results  for  Mach  0.3, 
a  15°  are  plotted  in  Pigure  48.  The  black  squares  mark  the  predictive  accuracy  of  single¬ 
fidelity  Kriging  models  that  were  trained  using  the  space-filling  stacked  Latin  hypercube.  Note 
that  predictive  accuracy  did  not  improve  as  more  samples  were  included;  instead,  as  the  size  of 
the  training  data  pool  increased  from  500  to  7,000  points,  the  predictive  RMSE  got  worse,  from 
0.53  to  0.61. 
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The  grey  squares  denote  Kriging  models  that  were  trained  using  the  same  staeked  Latin 
hypereube  eases,  but  whieh  also  made  use  of  APAS  data  via  Ghoreyshi  eokriging.  Like  the 
single-fidelity  results,  the  accuracy  of  the  models  did  not  improve  when  more  space-filling 
cases  were  included.  The  predictive  RMSE  values  for  the  smallest  and  largest  training  sets 
were  0.42  and  0.49,  respectively.  On  average,  the  incorporation  of  cheaper  data  reduced  RMSE 
by  about  0.13,  or  25  percent. 

The  grey  circles  mark  the  performance  of  surrogate  models  trained  with  the  adaptively-selected 
samples;  these  models  were  single-fidelity,  trained  only  with  CartSD  results.  Note  that  the 
samples  were  selected  using  multi-fidelity  contour-based  sampling.  The  single-fidelity  models 
were  trained  solely  for  the  purpose  of  comparison.  Initial  rounds  of  sampling  did  demonstrate 
improved  accuracy,  from  0.53  to  0.46,  although  this  improvement  leveled  off  after  six  or  seven 
rounds  of  sampling. 

Lastly,  the  models  produced  by  the  proposed  method  are  marked  with  black  circles.  These 
models  showed  some  variation  in  behavior,  but  overall  the  results  were  the  same  as  the  space¬ 
filling  cases:  predictive  accuracy  degraded  slightly  as  the  training  data  set  grew  larger.  Over  15 
rounds  of  adaptive  sampling,  the  training  data  set  grew  from  500  to  725  samples  and  the 
prediction  RMSE  grew  from  0.42  to  0.43.  To  translate  this  RMSE  value  into  probability  of  false 
positive,  if  the  latest  surrogate  model  predicted  some  untested  configuration  to  have  a  pitching 
moment  coefiicient  of  exactly  0  at  this  flight  condition,  there  would  be  a  81  percent  chance  that 
an  analysis  with  Cart3D  would  reveal  the  actual  pitching  moment  coefficient  to  be  >  ±0. 1 .  This 
is  larger  than  would  be  preferred,  but  still  an  improvement  over  the  space-filling  approach: 
after  7,000  samples,  the  stacked  Eatin  hypereube  approach  achieved  an  RMSE  of  0.612  and  the 
POEP  would  be  87  percent. 
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Figure  48:  Predictive  Accuracy 


Pitchiug  Momeut  at  Mach  0.3,  a  15° 


The  prediction  RMSE  and  POEP  scores  for  a  number  of  surrogate  models  are  given  in  Table  12 
to  allow  quick  comparisons.  The  proposed  method  had  smaller  RMSE  and  reduced  chance  of 
a  false  positive.  In  contrast,  single-fidelity  surrogate  models  trained  using  stacked  Latin 
hypereube  cases  grew  mildly  less  accurate  as  larger  data  sets  were  used. 
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These  results  illustrated  how  the  POFP  metric  could  at  times  behave  in  an  un-intuitive  manner; 
although  the  prediction  error  was  reduced  by  15-20  percent,  the  POFP  was  only  reduced  by  5- 
6  percent.  The  prediction  error  was  large  compared  to  the  response  range  of  interest,  and  so 
even  a  relatively  large  reduction  in  error  did  not  correspond  to  a  large  impact  on  POFP.  For 
smaller  RMSE  values,  a  similar  reduction  in  RMSE  would  produce  a  larger  reduction  in  POEP. 

The  neural  network  that  was  trained  to  emulate  this  response  during  the  previous  RBS  eiTort[40] 
was  also  evaluated.  This  network  was  trained  using  results  from  a  nested  Eatin  hypercube,  and 
was  based  only  on  CartSD  data.  It  was  interesting  to  note  that  the  neural  network  produced  a 
prediction  RMSE  of  1 .47  for  these  test  cases  -  a  value  much  larger  than  the  prediction  RMSE 
of  0.395  when  the  surrogate  was  tested  with  space-fdling  cases.  This  suggested  that  the  neural 
network  fit  other  regions  of  the  space  much  more  accurately  than  the  current  region  of  interest. 

Table  12:  Predictive  Accuracy  for  Pitching  Moment  Coefficient  at  Mach  0.3,  a  15° 


Number  of 
Converged  Cases 

RMSE 

POFP 

Stacked  LHC 
500-Case  Design 

458 

0.533 

85% 

Proposed  Method 

725  Samples 

651 

0.432 

81% 

Stacked  LHC 
1,000-Case  Design 

922 

0.549 

86% 

Stacked  LHC 
7,000-Case  Design 

6,430 

0.612 

87% 

Neural  Network  with 
Nested  Latin 
Hypercube  Design 

11,417 

1.47 

95% 

Overall,  the  results  were  mixed.  The  inclusion  of  cheaper  data  did  produce  a  consistent 
improvement  in  predictive  accuracy  of  roughly  25  percent.  However,  with  regard  to  sample 
selection,  neither  space-filling  nor  adaptive  samples  were  clearly  more  effective.  Both 
approaches  produced  reduced  accuracy  as  more  samples  were  added.  No  matter  which 
sampling  approach  was  used,  the  DACE  toolbox  was  unable  to  identify  correlation  weights 
which  improved  model  accuracy  over  that  of  the  underlying  trend  alone. 
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Figure  49:  Predictive  Accuracy  for  Pitchiug  Momeut  at  Mach  0.8,  a  0° 


Like  Mach  0.3,  the  surrogate  model  that  stood  in  for  APAS  at  Mach  0.8  had  fairly  large 
prediction  error,  as  shown  in  Figure  49.  The  standard  deviation  of  the  Model  Representation  Error 
was  roughly  0.65,  which  was  large  with  respect  to  the  width  of  the  response  region  of  interest. 


The  single-fidelity  space-filling  approach,  using  7,000  cases,  produced  a  prediction  RMSE  of 
0.864  and  a  POEP  of  91  percent.  Combining  space-filling  samples  and  data  fusion  produced 
a  mild  improvement,  particularly  for  larger  data  sets,  but  it  was  not  as  effective  as  it  was  for 
Mach  0.3,  a  15°.  Likewise,  single-fidelity  surrogates  trained  with  the  adaptively  sampled  cases 
were  almost  indistinguishable  from  those  trained  with  the  adaptive  samples  and  Ghoreyshi 
cokriging.  After  the  end  of  sampling,  the  final  surrogate  model  created  by  the  proposed  method 
had  a  prediction  RMSE  of  0.762,  which  corresponds  to  a  POEP  of  90  percent.  Note  that  as  the 
number  of  samples  increased,  both  sampling  approaches  became  markedly  less  accurate.  For 
further  comparison,  the  RBS  neural  network  used  1 1,159  cases  and  produced  an  RMSE  of 
0.688  for  these  test  cases;  for  space-filling  test  cases,  the  RMSE  was  0.274.  These  values  are 
presented  for  easier  comparison  in  Table  13. 


The  surrogates  based  on  2,000-4,000  cases,  which  demonstrated  a  sharp  increase  and  plateau  of 
prediction  error,  were  investigated  to  determine  the  source  of  this  degradation.  A  few  patterns 
were  observed:  the  trend  coelficients  for  Loft  Start  and  Loft  End  shifted  from  0(10  to 
0(10  ,  while  those  for  Vertical  Tail  Maximum  Thickness  Eocation  and  the  Inboard  Starboard 

Elevon  Deflection  decreased  from  0(10^^)  to  0(10 and  0(10^^),  respectively.  It  should  be 
noted  that  the  magnitudes  of  these  coelficients  remained  small  at  all  times,  and  that 
identification  of  these  coelficients  as  the  cause  for  the  change  in  predictive  accuracy  was 
tentative  at  best. 
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Table  13:  Predictive  Accuracy  for  Pitching  Moment  Coefficient  at  Mach  0.8,  a  0° 


Number  of 
Converged  Cases 

RMSE 

POFP 

Stacked  LHC 
500-Case  Design 

458 

0.615 

87% 

Proposed  Method 

651 

0.724 

90% 

Stacked  LHC 
1,000-Case  Design 

922 

0.695 

91% 

Stacked  LHC 
7,000-Case  Design 

6,430 

0.864 

96% 

Neural  Network  with 
Nested  Latin 
Hypercube  Design 

11,159 

0.688 

88% 

The  prediction  accuracy  results  for  Mach  2.5,  a  15°  are  given  in  Figure  50.  Here,  data  fusion 
made  a  very  significant  difference  with  regard  to  predictive  accuracy,  reducing  the  RMSE  value 
from  0.935  to  0.384.  Adaptive  sampling  improved  the  results  from  there,  bringing  the  RMSE 
down  to  0.262.  The  rate  of  improvement  due  to  additional  sampling  was  not  as  rapid  as  for 
Mach  0.3. 

Surrogates  trained  with  space-filling  samples  and  Ghoreyshi  cokriging  are  marked  with  grey 
squares.  It  was  clear  that  data  fusion  made  a  powerful  contribution  for  this  response, 
reducing  RMSE  by  roughly  60  percent  in  many  cases.  The  grey  circles  denote  single-fidelity 
surrogates  trained  with  adaptively-selected  samples.  It  bears  repeating  that  these  samples  were 
selected  using  multi-fidelity  contour-based  sampling;  the  single-fidelity  surrogates  were  only 
trained  for  performance  comparison.  The  resulting  surrogates  did  improve  in  accuracy  much 
more  rapidly  than  those  trained  with  space-filling  samples,  reducing  prediction  error  by  about  30 
percent. 
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Figure  50:  Predictive  Accuracy  for  Pitchiug  Momeut  at  Mach  2.5,  a  15° 

The  proposed  approach,  combining  both  data  fusion  and  adaptive  sampling,  was  the  most 
effective  for  this  response.  With  an  RMSE  value  of  0.262,  the  probability  of  a  false  positive 
(POFP)  for  the  proposed  approach  was  70  percent.  The  space-fdling  model  based  on  7,000 
samples  produced  a  prediction  RMSE  value  of  0.810,  resulting  in  a  POFP  value  of  90  percent, 
while  the  surrogate  based  on  7,000  space-fdling  samples  and  Ghoreyshi  cokriging  produced  a 
prediction  RMSE  of  0.274  (equal  to  a  POFP  of  72  percent).  This  was  a  much  larger 
improvement  in  RMSE  than  at  Mach  0.3,  although  most  of  the  improvement  appears  to  be  due 
to  the  incorporation  of  APAS  data.  The  neural  network  generated  during  the  RBS  project  was 
trained  with  10,458  space-fdling  cases  and  exhibited  a  prediction  RMSE  of  0.887  when  applied 
to  the  test  cases  with  small  pitching  moment  coefldcients.  When  evaluated  for  space-filling  test 
cases,  the  RMSE  of  that  neural  network  was  0.262.  These  results  are  given  in  Table  14. 

Tabic  14:  Predictive  Accuracy  for  Pitching  Moment  Cocfficicut  at  Mach  2.5, «  15° 


Number  of 
Converged  Cases 

RMSE 

POFP 

Stacked  LHC 
500-Case  Design 

458 

0.935 

91% 

Proposed  Method 

651 

0.262 

70% 

Stacked  LHC 
1,000-Case  Design 

922 

0.841 

91% 

Stacked  LHC 
7,000-Case  Design 

6,430 

0.810 

90% 

Neural  Network  with 
Nested  Latin 
Hypercube  Design 

10,458 

0.887 

91% 

162 

Approved  for  public  release;  distribution  unlimited 


The  prediction  accuracy  results  for  Mach  2.5,  a  40°  are  given  in  Figure  51.  Data  fusion  was 
very  powerful  for  this  flight  condition  as  well,  improving  prediction  RMSE  from  3.15  to  1.23 
for  the  500-case  training  set  as  shown  by  the  grey  squares.  There  was  still  ample  room 
remaining  for  improvement.  The  grey  circles  show  the  performance  of  single-fidelity 
surrogates  trained  using  adaptively-sampled  cases;  improvement  was  much  more  rapid  than  for 
space-filling  cases,  although  these  models  did  not  attain  the  predictive  accuracy  of  the  multi¬ 
fidelity  surrogates.  When  both  techniques  were  combined  (indicated  by  the  black  circles),  the 
prediction  RMSE  of  a  model  incorporating  data  fusion  was  improved  to  0.724  after  fifteen 
rounds  of  samples. 


Figure  51:  Predictive  Accuracy  for  Pitchiug  Momeut  at  Mach  2.5,  a  40° 
Table  15:  Predictive  Accuracy  for  Pitchiug  Momeut  Coefficieut  at  Mach  2.5, «  15° 


Number  of 
Converged  Cases 

RMSE 

POFP 

Stacked  LHC 
500-Case  Design 

458 

3.15 

97% 

Proposed  Method 

651 

0.724 

89% 

Stacked  LHC 
1,000-Case  Design 

922 

2.99 

97% 

Stacked  LHC 
7,000-Case  Design 

6,430 

3.28 

96% 

Neural  Network  with 
Nested  Latin 
Hypercube  Design 

10.991 

1.98 

91% 

The  final  prediction  RMSE  for  the  proposed  method  using  710  cases  was  0.724,  which 
corresponded  to  a  POFP  of  89  percent.  Using  7,000  cases,  the  space-filling  approach 
produced  a  prediction  RMSE  of  3.28,  which  was  equivalent  to  a  POEP  of  98  percent.  Similar 
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to  the  Mach  2.5,  a  15°  flight  condition,  the  incorporation  of  multi-fidelity  modeling  produced  a 
substantial  improvement  in  predictive  accuracy.  Note  that  for  space-filling  sampling,  the 
predictive  accuracy  stagnated  quickly  and  did  not  improve  as  more  samples  were  added.  In 
contrast,  surrogates  based  on  the  adaptive  sampling  approach  using  fewer  than  800  samples 
out-performed  surrogates  based  on  nearly  ten  times  as  many  space-filling  samples. 

For  comparison,  the  neural  network  for  this  flight  condition  from  the  RBS  project  used  10,991 
cases  and,  when  applied  to  the  present  set  of  test  cases  with  small  pitching  moment 
coefficients,  produced  an  RMSE  value  of  1.98;  its  RMSE  for  space-filling  samples  was  0.456. 
These  results  are  organized  for  easier  comparison  in  Table  15. 

The  very  poor  performance  for  the  method  at  the  Mach  0.8  flight  condition  was  investigated 
to  determine  the  likely  cause  of  the  error.  Given  the  relatively  poor  accuracy  of  the  surrogate 
model  for  APAS  at  that  flight  condition  (the  prediction  error  had  a  standard  deviation  of 
around  0.65),  it  seemed  possible  that  the  poor  behavior  stemmed  from  bad  low-fidelity 
estimates.  To  test  this  possibility,  multi-fidelity  surrogates  were  trained  using  data  directly 
from  APAS  rather  than  from  surrogates  of  APAS  data.  Multi-fidelity  surrogates  were  trained 
based  on  the  500-case  level  of  the  stacked  Eatin  hypercube.  One  set  of  surrogates  was  trained 
using  surrogate  model  predictions  of  APAS  results  as  the  low-fidelity  data  source,  while  the 
other  set  was  trained  using  actual  APAS  results  as  the  low-fidelity  data  source.  Both  sets  of 
models  were  then  tested  to  assess  predictive  accuracy.  The  results  of  these  tests  are  presented 
in  Table  16.  When  APAS  was  used  directly,  there  were  slight  improvements  at  Mach  0.3,  a  15° 
and  Mach  2.5,  a  15°,  as  well  as  a  more  substantial  improvement  at  Mach  2.5,  a  40°. 
Unfortunately,  there  was  no  improvement  at  the  Mach  0.8  flight  condition,  which  was  the 
motivation  for  this  test. 

It  would  appear  that  a  linear  model,  even  when  augmented  by  the  low-fidelity  response  value, 
was  a  poor  match  for  the  behavior  of  the  pitching  moment  coefficient  at  Mach  0.8,  a  0°. 
Based  on  the  example  of  the  two-dimensional  Sphere  function,  it  is  expected  that  if  enough 
samples  were  available  the  algorithm  would  eventually  achieve  an  accurate  understanding  of 
the  response.  However,  there  is  no  way  of  knowing  whether  that  accuracy  would  be  obtained 
after  10  samples  or  10,000. 
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Table  16:  Evaluating  Effects  of  Low-Fidelity  Surrogates  On  Overall  Accuracy 


Prediction  RMSE 
Using  Surrogates 

Prediction  RMSE 
Using  APAS  Directly 

Mach  0.3 

a  15° 

0.535 

0.525 

Mach  0.8 

a  0° 

0.356 

0.394 

Mach  2.5 

a  15° 

Q211 

0.245 

Mach  2.5 

a  40° 

0.979 

0.794 

Note  that  the  pitching  moment  coefficient  at  Mach  0.8,  a  0°  was  predicted  well  by  a  multi- 
fidelity  model  with  a  linear  underlying  trend,  as  shown  earlier  in  this  chapter.  In  that 
demonstration,  the  dacefit  utility  was  able  to  identify  correlation  parameters  which  improved 
the  prediction  accuracy  of  the  model.  When  those  parameters  can  be  identified,  Kriging  can 
better  estimate  how  the  response  diverges  from  the  linear  underlying  trend.  Although  dacefit  was 
unsuccessful  at  identifying  useful  correlation  parameters  in  this  enlarged  problem,  it  was 
expected  that  once  such  parameters  could  be  identified,  the  predictive  accuracy  of  the  Kriging 
model  would  increase  substantially  -  at  least  in  the  neighborhood  of  the  observed  samples. 

The  relative  gains  of  the  proposed  method  against  space-filling  sampling  -  using  either 
single-fidelity  or  multi-fidelity  surrogates  -  are  presented  in 

Table  17.  For  most  of  the  responses,  multi-fidelity  modeling  produced  the  lion’s  share  of  the 
accuracy  improvement.  This  was  unsurprising  given  that  the  adaptive  sampling  algorithm 
depends  on  a  reasonably  accurate  understanding  of  the  behavior  of  each  response;  samples  are 
selected  based  on  this  understanding,  so  if  the  understanding  is  poor,  the  samples  may  not  be  very 
helpful.  Adaptive  sampling  still  produced  5-20  percent  improvement  over  the  use  of  multi¬ 
fidelity  modeling  alone. 
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Table  17:  Overall  Comparison  of  Predictive  Accuracy  for  Longitudinal  Responses 


Error  Reduction  Vs. 
Space-Filling 
Single-Fidelity 

Error  Reduction  Vs. 
Space-Filling 
Multi-Fidelity 

Reduction  In  No.  of 
Analyses 

Mach  0.3, 
a  15° 

29% 

11% 

90% 

Mach  0.8, 
a  0° 

12% 

4.4% 

90% 

Mach  2.5, 
a  15° 

68% 

4.4% 

90% 

Mach  2.5, 
a  40° 

78% 

20% 

90% 

Overall,  although  the  gains  were  substantial,  more  progress  would  be  necessary  before  such 
surrogate  models  could  be  used  for  design  space  exploration  with  confidence;  none  of  the 
surrogates,  whether  based  on  the  null  or  alternative  hypotheses,  achieved  a  prediction  RMSE 
score  below  0.1,  which  would  be  equivalent  to  a  POFP  less  than  33  percent. 

For  most  responses,  the  proposed  approach  was  moderately  effective,  producing  significant 
reductions  in  prediction  RMSF.  The  proposed  approach  was  also  much  more  efficient  with 
respect  to  high-fidelity  analyses:  the  improved  predictive  accuracy  was  obtained  with  a 
reduction  in  the  number  of  expensive  analyses  by  almost  90  percent  compared  to  the  space¬ 
filling  approach  (725  analyses  per  flight  condition  compared  to  7,000).  Despite  these 
achievements,  further  improvements  would  still  be  required  before  surrogate  models  such  as 
these  could  be  used  for  engineering  purposes. 

8.2.8  Evaluation  of  Accuracy  for  Lateral  Moments 

Although  the  RBS  project  found  that  pitching  moments  were  difficult  to  model  accurately, 
there  were  also  substantial  difficulties  in  modeling  the  lateral  responses.  Because  the  four  flight 
conditions  evaluated  for  this  effort  only  included  symmetric  conditions,  all  of  the  lateral 
responses  were  likely  to  be  near  zero.  However,  asymmetric  deflections  of  control  surfaces 
would  still  produce  nonzero  lateral  responses  that  should  be  modeled  as  accurately  as  possible. 

Recall  that  the  APAS  geometry  definition  did  not  capture  control  surface  deflections,  and  thus 
the  low-fidelity  model  could  not  capture  any  asymmetric  effects.  Any  lateral  responses 
calculated  by  APAS  would  therefore  be  known  in  advance  to  be  spurious.  Although  it  was 
shown  that  multi-fidelity  models  with  nuggets  could  overcome  unhelpful  low- fidelity  data,  it 
was  decided  that  APAS  results  would  not  be  used  in  these  tests.  Without  low-fidelity  surrogate 
models,  the  only  remaining  source  of  uncertainty  in  the  data  was  the  iteration  noise  produced 
by  Cart3D.  This  uncertainty  was  tracked  and  incorporated  via  nuggets  when  Kriging  models 
were  trained. 
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It  was  previously  shown  in  this  effort  that  eapturing  uneertainty  via  nuggets  was  most 
eflfeetive  when  the  uneertainty  in  the  response  value  was  eorrelated  with  the  magnitude  of  that 
response.  The  correlation  values  for  the  lateral  responses  for  the  49-dimension  problem  were 
calculated  to  determine  how  effective  nuggets  were  likely  to  be  at  reducing  prediction  error  for 
those  responses.  The  results  for  rolling  moments  are  presented  in  Table  18,  and  the  results  for 
yawing  moments  are  presented  in  Table  19. 


Table  18:  Correlation  Between  Uncertainty  &  Response  Magnitude:  Rolling  Moments 


Mach  0.3, 
a  15° 

Mach  0.8, 
a  0° 

Mach  2.5, 
a  15° 

Mach  2.5, 
a  40° 

Correlation  Between  Mean  jj, 
and  Standard 

Deviation  of  Uncertainty  a 

-0.021 

0.16 

0.047 

0.036 

Table  19:  Correlation  Between  Uncertainty  &  Response  Magnitude:  Yawing  Moments 


Mach  0.3, 
a  15° 

Mach  0.8, 
a  0° 

Mach  2.5, 
a  15° 

Mach  2.5, 
a  40° 

Correlation  Between  Mean  jj, 
and  Standard 

Deviation  of  Uncertainty  a 

-0.097 

0.23 

0.12 

-0.012 

For  the  most  part,  the  correlation  between  response  magnitude  and  uncertainty  was  small.  The 
exception  to  this  was  at  Mach  0.8,  a  0°,  where  the  correlation  coefficient  was  0.16  for  the 
rolling  moment  coefficient  and  0.23  for  the  yawing  moment  coefficient.  These  correlation 
values  were  still  relatively  small  but  indicated  that  some  moderate  degree  of  noise  was 
present  in  the  data.  Incorporating  this  noise  using  nuggets  could  therefore  improve  the 
accuracy  of  surrogate  models  for  those  two  responses.  In  general,  though,  these  results 
indicated  that  the  bulk  of  the  observed  response  behavior  was  due  to  actual  flow  phenomena 
and  not  spurious. 

To  determine  whether  the  observed  response  behavior  was  due  to  legitimate  aerodynamic  effects 
or  numerical  noise,  additional  tests  were  performed  to  determine  the  correlation  between 
rolling  and  yawing  moments  and  the  design  variables.  It  was  found  that  the  lateral  responses 
were  strongly  correlated  with  the  control  surface  deflections,  as  shown  in  Table  20.  Such 
deflections  would  in  fact  produce  nonzero  rolling  and  yawing  moments,  suggesting  that  the 
observed  behavior  was  driven  by  legitimate  effects  rather  than  by  noise  in  the  data. 
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Table  20:  Correlation  Between  Lateral  Responses  &  Control  Surface  Deflections 


Outboard 

Elevon 

(Starboard) 

Outboard 

Elevon 

(Port) 

Rudder 

(Starboard) 

Rudder 

(Port) 

Mach  0.3,  a  15°  Crou 

-0.86 

0.78 

-0.37 

-0.27 

Mach  0.8,  a  0°  Crou 

-0.88 

0.73 

-0.38 

-0.23 

Mach  2.5,  a  15°  CroU 

-0.78 

0.83 

-0.31 

-0.36 

Mach  2.5,  a  40°  CroU 

-0.85 

0.77 

-0.26 

-0.36 

Mach  0.3,  a  15°  Cyqw 

0.51 

-0.75 

0.52 

0.15 

Mach  0.8,  a  0°  Craw 

0.54 

-0.76 

0.53 

0.14 

Mach  2.5,  a  15°  Cyaw 

0.50 

-0.85 

0.33 

0.35 

Mach  2.5,  a  40°  Cyaw 

0.66 

-0.87 

0.28 

0.41 

Two  sets  of  models  were  trained;  the  first  set  embodied  the  null  hypothesis  and  was  made  up  of 
the  first  4,000  points  of  the  stacked  Latin  hypercube.  These  models  were  fit  with  no  nuggets, 
treating  all  observed  response  values  as  perfectly  deterministic.  The  second  set  embodied  the 
alternative  hypothesis  and  consisted  of  the  first  500-point  space-filling  level  of  the  stacked  Latin 
hypercube  plus  9  batches  of  adaptive  samples  which  had  been  selected  to  improve  the  pitching 
moment  surrogates.  These  surrogates  were  trained  using  nuggets  based  on  the  observed  iterative 
noise  in  each  response.  Both  sets  of  models  were  fit  using  only  CartSD  results. 


These  models  were  not  evaluated  using  POFP,  the  probability  of  false  positives,  because  unlike 
pitching  moment  coelficient,  there  was  no  rule  of  thumb  available  to  set  bounds  on  the  lateral 
control  authority  of  a  reusable  booster  vehicle. 


The  prediction  accuracy  for  rolling  moment  at  Mach  0.3,  a  15°  is  shown  in  Figure  52.  Figure 
52a  shows  the  full  range  of  results,  indicating  that  the  space-filling  sampling  produced 
diminishing  returns  after  perhaps  2,000  samples.  A  cropped  view  is  shown  in  Figure  52b  to 
more  clearly  illustrate  the  behavior  of  the  models  based  on  adaptive  sampling.  After  the  most 
recent  set  of  adaptive  cases,  the  proposed  approach  with  nuggets  produced  a  prediction  RMSE 
of  0.071 1  using  725  cases.  The  space-filling  cases  produced  prediction  RMSE  values  of  0.1 12 
and  0.636  based  on  500  and  1,000  cases,  respectively. 
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Figure  52:  Predictive  Accuracy  for  Rolling  Moment  at  Mach  0.3,  a  15° 

These  results  indicated  that  for  this  response,  incorporating  noise  did  not  substantially  reduce 
prediction  error.  This  agreed  with  the  observation  that  iteration  noise  had  minimal  correlation 
with  the  response  magnitude.  For  this  response,  the  proposed  method  was  approximately 
equivalent  to  the  baseline  approach. 

The  results  for  rolling  moment  coeflhcient  prediction  accuracy  at  Mach  0.8,  a  0°  appear  in 
Figure  53a  &  15b.  At  this  flight  condition,  the  space-filling  approach  started  out  with  a 
prediction  RMSE  of  0.0839  based  on  500  cases  but  improved  to  0.0591  for  1,000  space-filling 
cases.  For  larger  numbers  of  cases,  the  average  prediction  RMSE  was  0.0588. 

The  surrogate  models  based  on  the  proposed  approach  demonstrated  rapid  initial  improvement, 
similar  to  the  noiseless  surrogates,  and  appeared  to  show  more  rapid  improvement  than  the 
baseline  as  the  training  data  pool  grew  larger.  After  fifteen  rounds  of  adaptive  sampling  (725 
samples  total),  the  prediction  RMSE  was  0.0605.  The  prediction  RMSE  after  fourteen  rounds 
of  sampling  (710  cases)  was  0.0547,  better  than  any  noiseless  model  that  was  trained  with  fewer 
than  6,500  cases. 


Figure  53:  Predictive  Accuracy  for  Rolliug  Momeut  at  Mach  0.8,  a  0° 
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In  general,  there  was  mild  evidence  that  the  proposed  method  might  have  produced  more- 
accurate  surrogates  if  sampling  had  continued;  based  only  on  the  available  evidence  the  two 
approaches  are  effectively  neck-and-neck.  These  results  were  consistent  with  the  calculated 
correlation  between  the  iteration  noise  for  this  response  and  the  response  magnitude  (0.16), 
which  would  suggest  mild  improvements  at  best. 

Figure  54  shows  the  results  for  Mach  2.5,  a  15°.  As  was  observed  for  Mach  0.8,  the  space-filling 
approach  did  not  obviously  improve  for  sampling  sizes  above  1,000  cases.  The  average 
prediction  RMSE  for  those  models  was  0.0177,  while  the  RMSE  for  the  model  based  on  500 
cases  was  0.0279. 


Cases  Included  Cases  Included 

(a)  (b) 

Figure  54:  Predictive  Accuracy  for  Rolling  Moment  at  Mach  2.5,  a  15° 

Once  again,  the  proposed  approach  with  nuggets  appeared  to  improve  prediction  accuracy 
slightly  more  eflhciently  than  the  space-filling  approach,  but  the  surrogate  models  based  on 
the  proposed  approach  were  never  clearly  superior.  The  prediction  RMSE  values  after  the 
fourteenth  and  fifteenth  rounds  of  adaptive  sampling  were  0.0187  and  0.0240,  respectively. 
The  correlation  coeflhcient  between  iteration  noise  and  response  magnitude  for  this  response 
was  0.047,  which  suggested  that  capturing  iteration  noise  would  not  produce  much  improvement 
a  deduction  consistent  with  the  numerical  results. 

Eastly,  the  prediction  performance  for  rolling  moment  coefficient  at  Mach  2.5,  a  40°  may  be 
seen  in  Eigure  55a  &  17b.  As  with  the  other  flight  conditions,  models  based  on  the 
proposed  approach  appeared  to  reduce  RMSE  using  fewer  cases  than  the  space-filling 
approach,  but  fifteen  batches  were  not  sulficient  to  demonstrate  this  one  way  or  the  other.  The 
average  RMSE  for  models  based  on  at  least  1,000  space-filling  cases  was  0.0256.  The  RMSE 
values  after  the  fourteenth  and  fifteenth  batches  of  adaptive  samples  were  0.0321  and  0.0418, 
respectively. 
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Cases  Included 
(b) 


Figure  55:  Predictive  Accuracy  for  Rolling  Moment  at  Mach  2.5,  a  40° 


With  respect  to  rolling  moment  coefficient,  the  proposed  approach  did  not  produce  surrogate 
models  that  were  significantly  more  accurate  than  the  baseline  approach.  This  was  not 
entirely  surprising,  as  the  correlation  calculations  in  Table  18  indicated  that  iteration  noise  was 
not  strongly  afifecting  those  responses.  The  strongest  correlation  between  iteration  noise  and  a 
lateral  response,  0.23,  was  for  the  yawing  moment  coefficient  at  Mach  0.8,  as  noted  in  Table 
19.  The  next  section  will  quantify  the  effects  of  nuggets  when  predicting  yawing  moment 
coefficients. 


Figure  56  illustrates  how  prediction  RMSE  for  the  yawing  moment  at  Mach  0.3,  a  15°  varied 
with  sampling  and  modeling  approach.  The  space-filling  approach  seemed  to  converge  to  an 
RMSE  value  of  around  0.024  after  roughly  1,500  cases.  The  models  using  nuggets  did 
demonstrate  slightly  improved  performance  -  RMSE  for  the  500-case  set  when  nuggets  are  used 
was  0.0300,  versus  0.0353  without  nuggets  -  but  the  results  did  not  demonstrate  consistent 
improvement  as  more  samples  were  added.  After  the  fifteen  sets  of  samples,  the  prediction 
RMSE  for  models  with  nuggets  was  0.0307,  slightly  worse  than  the  surrogate  trained  with  500 
samples. 


Figure  56:  Predictive  Accuracy  for  Yawiug  Momeut  at  Mach  0.3,  a  15° 
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The  results  for  yawing  moment  at  Mach  0.8,  a  0°  are  shown  in  Figure  57.  For  noiseless  models 
trained  with  the  space-filling  samples,  there  was  a  clear  improvement  as  more  samples  are  used: 
the  model  based  on  the  set  of  500  samples  had  a  prediction  RMSE  of  0.0847,  while  the  one 
based  on  2,000  samples  had  a  prediction  RMSE  of  0.065  and  that  based  on  4,000  samples 
produced  a  value  of  0.0464.  This  improvement  leveled  off  when  more  than  4,000  samples  are 
available. 


Cases  Included  Cases  Included 

(a)  (b) 

Figure  57:  Predictive  Accuracy  for  Yawiug  Momeut  at  Mach  0.8,  a  0° 

Eor  this  response,  there  was  a  clear  benefit  to  the  use  of  nuggets.  Incorporating  uncertainty  due 
to  iteration  for  the  500-case  sample  set  reduced  prediction  RMSE  from  0.0847  to  0.0554. 
Models  based  on  later  data  sets  produced  prediction  RMSE  values  between  0.0481  and  0.0661. 
Even  the  least-accurate  surrogate  that  used  nuggets  was  more  accurate  than  every  noiseless 
surrogate  trained  on  fewer  than  2,000  samples. 

The  use  of  nuggets  allowed  a  surrogate  trained  with  725  samples  to  out-perform  a  noiseless 
surrogate  trained  with  more  than  twice  as  much  data.  The  correlation  between  response 
magnitude  and  iteration  noise  was  0.23,  which  had  led  to  the  expectation  that  surrogates  using 
nuggets  would  be  more  accurate  for  this  response.  These  observations  supported  that  conclusion. 

All  surrogate  models  for  yawing  moment  coelficient  at  Mach  2.5,  a  15°  had  very  similar 
performance.  Surrogates  without  nuggets  based  on  space-filling  samples  demonstrated  slight 
improvement  as  more  cases  were  added,  starting  at  a  prediction  RMSE  of  0.0255  and  averaging 
an  RMSE  of  0.0191  when  more  than  1,500  samples  were  incorporated.  Surrogates  which 
incorporated  nuggets  did  not  perform  any  better:  the  prediction  RMSE  values  ranged  from 
0.0235  to  0.0276.  The  correlation  coefficient  between  response  magnitude  and  iteration  noise 
for  this  response  was  0.12,  indicating  that  iteration  noise  was  unlikely  to  be  a  major  factor  in 
prediction  error.  These  results  are  displayed  in  Eigure  58. 
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Figure  58:  Predictive  Accuracy  for  Yawiug  Momeut  at  Mach  2.5,  a  15° 


Finally,  the  results  for  yawing  moment  predietions  at  Maeh  2.5,  a  40°  are  depieted  in  Figure  59. 
These  results  were  very  similar  to  those  for  Maeh  2.5,  a  15°.  Noiseless  surrogates  based  on 
space-fdling  samples  showed  mild  improvement  as  more  samples  were  added,  eventually 
settling  down  to  an  average  RMSE  of  0.0205  when  more  than  1,500  eases  were  available. 


Figure  59:  Predictive  Accuracy 


Cases  Included 
(b) 

Yawing  Moment  at  Mach  2.5,  a  40° 


The  proposed  method  produeed  a  slight  improvement  for  the  initial  sample  set  of  500  oases, 
reduoing  prediotion  RMSE  from  0.0226  to  0.0209,  but  subsequent  sets  of  data  produeed 
worse  prediotion  RMSE  values  than  were  produeed  by  the  noiseless  models.  The  oorrelation 
ooeflhcient  between  response  magnitude  and  iteration  noise  was  0.012,  whioh  had  indioated  that 
nuggets  would  not  substantially  improve  prediotion  error  for  this  response. 


In  general,  the  proposed  method  -  and  in  partioular  the  use  of  nuggets  to  oapture  data 
unoertainty  -  demonstrated  only  minor  improvements  at  best  oompared  to  noiseless  models 
for  most  flight  oonditions.  However,  at  Maeh  0.8  a  signifioant  benefit  was  observed  for  the 
yawing  moment  ooeflfioient,  and  noiseless  surrogates  required  4-5  times  as  many  samples  to 
equal  the  performanee  of  the  surrogates  which  captured  uncertainty  via  nuggets.  It  was  found 
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that  the  impact  of  nuggets  closely  matched  the  degree  of  correlation  between  the  response 
magnitude  and  the  iteration  noise,  which  indicates  that  the  user  may  only  wish  to  use  nuggets 
when  that  correlation  is  relatively  large  (>0.2).  The  causes  for  the  observed  behaviors  will  be 
discussed  in  the  next  section. 

8.2.9  Interpretation  of  Longitudinal  &  Lateral  Results 

With  regard  to  pitching  moment,  the  proposed  method  was  quite  effective  for  most  flight 
conditions.  After  15  batches  of  adaptive  samples,  the  prediction  RMSE  at  Mach  0.3  was  30 
percent  smaller  than  the  single-fidelity  surrogate  trained  with  7,000  space-filling  samples.  At 
Mach  0.8,  the  surrogate  based  on  15  batches  of  adaptive  samples  was  12  percent  more  accurate 
than  the  baseline  approach.  For  the  two  Mach  2.5  flight  conditions  (a  15  &  40°),  RMSE  was 
reduced  by  67  percent  and  78  percent  respectively.  Note  that  these  improvements  in  accuracy 
were  achieved  while  reducing  the  number  of  training  samples  by  nearly  90  percent  compared  to 
the  single-fidelity,  space-fdling  approach. 

The  pitching  moment  coefficient  at  Mach  0.8,  a  0°  proved  to  be  the  most  difficult  to  predict 
accurately.  At  this  flight  condition,  the  use  of  multiple  sources  of  data  provided  marginal 
improvements  at  best,  and  it  appeared  that  the  response  behavior  could  not  be  easily  modeled 
with  a  linear  underlying  trend.  As  more  training  data  was  used,  the  model’s  performance 
became  progressively  worse,  ft  was  expected  that  this  trend  would  reverse  itself  eventually,  as 
was  observed  with  the  two-dimensional  Sphere  Function  example,  but  it  was  not  known  when 
that  reversal  might  occur. 

The  proposed  method  demonstrated  no  significant  improvements  when  used  to  predict  rolling 
moment  coefficients.  When  applied  to  yawing  moment  coefficients,  the  proposed  method 
produced  clear  improvements  for  Mach  0.8  but  was  approximately  equivalent  with  the  baseline 
approach  at  other  flight  conditions.  The  proposed  method  primarily  distinguished  itself  for 
lateral  responses  through  its  use  of  nuggets.  The  other  specialized  aspects  of  the  proposed 
method  -  use  of  lower-fidelity  data  and  an  iterative  sampling  strategy  -  were  not  relevant 
because  the  cheaper  APAS  data  did  not  capture  the  phenomena  which  drove  the  lateral 
responses  observed  in  Cart3D,  and  the  iterative  strategy  focused  purely  on  the  pitching  moment 
coefficients.  This  left  the  use  of  nuggets  as  the  distinguishing  feature  between  the  standard 
method  and  the  proposed  method. 

ft  was  previously  shown  that  the  use  of  nuggets  produced  significant  improvements  in  fit 
accuracy  when  the  response  was  correlated  with  the  amount  of  noise  in  that  response  -  in 
essence,  when  any  large  values  that  were  observed  were  more  likely  to  be  spurious  than 
representative  of  actual  response  behavior.  Observations  from  this  demonstration  mirrored 
those  results.  The  lateral  response  with  the  strongest  correlation  between  the  response 
magnitude  and  iteration  noise  was  the  yawing  moment  coefficient  at  Mach  0.8,  a  0°,  and  it  was 
this  response  that  showed  the  largest  improvement  in  predictive  accuracy  when  the  proposed 
method  was  applied. 

8.2.10  Shortcomings  of  Full-Scale  Test 

The  biggest  shortcoming  of  the  full-scale  test  was  the  inability  of  the  Kriging  surrogate  models  - 
all  the  Kriging  surrogate  models  -  to  identify  useful  correlation  coefficients.  Eacking  such 
coefficients,  the  Kriging  surrogates  would  only  deviate  from  the  underlying  linear  trend  models 
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in  the  very  close  vicinity  to  training  points.  Given  that  the  response  behaviors  were  unlikely  to 
be  linear  over  all  49  input  parameters,  this  led  to  relatively  surrogates. 

During  the  literature  search,  it  was  noted  that  a  sparse  correlation  matrix  would  lead  to  a  heavy 
dependence  on  the  underlying  trend  model.  That  generalization  was  made  in  the  context  of 
applying  sparse  methods  to  the  problem  in  the  event  that  the  correlation  matrix  became  so  large 
and  ungainly  that  it  approached  the  memory  limits  of  Matlab,  the  program  being  used  to 
create  the  Kriging  surrogates.  Although  no  sparse  methods  were  applied  to  the  problem  at  hand, 
the  same  effect  -  surrogate  models  which  depended  strongly  on  the  underlying  trend,  with 
correlation  parameters  affecting  the  predictions  only  rarely  -  was  observed  in  the  results.  This 
indicated  that  sparseness  effects  might  have  played  some  role  in  the  diflhculties  that  were 
experienced  in  the  full-scale  test.  To  test  this  possibility,  another  study  was  designed  to 
determine  whether  the  difficulties  were  related  to  sample  sparseness. 

8.2.11  Investigation  of  Sparsity  Effects 

Investigating  this  possibility  for  the  full-scale  problem  was  daunting:  the  most  direct 
approach  would  have  been  to  increase  the  sample  density,  and  by  extension  increase  the  number 
of  samples.  This  was  a  difficult  proposition,  since  training  sets  of  up  to  7,000  cases  had 
difficulty  identifying  useful  correlation  coefficients;  training  sets  with  more  than  7,000  cases 
produced  “Out  of  Memory”  errors  from  Matlab.  Instead,  based  on  the  recommendations  of  the 
research  committee,  the  smaller  9-dimensional  problem  was  used.  Instead  of  progressively 
increasing  the  sample  density  for  the  larger  problem,  the  sample  density  would  be 
progressively  decreased  for  the  smaller  problem.  As  surrogate  models  were  trained  with 
smaller  and  smaller  data  sets,  it  was  expected  that  they  would  eventually  behave  in  the  same 
manner  as  the  surrogates  for  the  large-scale  test. 

A  stacked  Latin  hypercube  was  generated  using  eighty  25 -case  Latin  hypercubes  for  a  total  of 
2,000  cases.  When  every  dimension  was  normalized  to  a  zero-to-one  range,  the  maximin 
distance  between  two  points  was  0.272.  For  comparison,  fifty  2,000-case  standard  Latin 
hypercubes  were  generated;  the  best-spaced  hypercube  had  a  maximin  distance  of  0.233, 
indicating  that  the  stacked  Latin  hypercube  cases  were  well  spread  out  throughout  the  design 
space. 

Surface  meshes  for  the  cases  in  the  stacked  Latin  hypercube  were  generated  with  the  PaceLab 
geometry  tool  and  analyzed  with  Cart3D  at  the  three  relevant  flight  conditions:  Mach  0.3,  a 
15°;  Mach  0.8,  a  0°;  and  Mach  2.5,  a  0°.  Single-fidelity  surrogate  models  were  trained  to 
emulate  the  pitching  moment  coefficient  at  every  flight  condition  using  progressively  larger 
levels  of  the  stacked  Latin  hypercube  (SLHC),  starting  with  the  smallest  set  of  25  space-filling 
points.  All  Kriging  models  were  trained  using  an  anisotropic  Gaussian  correlation  function  and 
a  linear  underlying  trend.  This  was  different  than  the  previous  surrogate  models  made  of  these 
responses,  which  used  quadratic  underlying  trends.  The  change  was  made  because  a  lower- 
order  trend  would  require  fewer  samples  -  a  9-dimensional  quadratic  trend  would  require  91 
samples  to  fit  a  model,  while  a  linear  trend  would  only  require  10  -  and  the  objective  was  to 
investigate  effects  related  to  sparsity. 

The  predictive  accuracy  of  the  resulting  surrogate  models  was  then  evaluated  using  the  test 
cases  for  this  problem.  The  results  are  plotted  in  Figure  60.  Figure  60a  shows  the  results  when 
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predicting  Cm  at  Mach  0.3,  a  15°;  Figure  60b  shows  the  results  for  Mach  0.8,  a  0°;  and  Figure 
60c  shows  the  results  for  Mach  2.5,  a  0°.  Note  that  for  all  three  responses,  the  predictive 
accuracy  improves  rapidly  until  100-125  cases  are  included,  and  then  the  improvement  is  more 
sedate.  This  change  in  behavior  indicates  that  a  different  phenomenon  is  at  play  for  the  smaller 
training  sets. 


Figure  60:  Change  in  Predictive  Accuracy  as  Training  Set  Grows 


To  further  investigate  this  behavior,  the  correlation  between  training  points  and  test  points  was 
calculated  for  each  training  set.  The  correlation  was  calculated  with  the  same  anisotropic 
Gaussian  correlation  function  used  in  the  Kriging  models,  while  the  correlation  coefficients 
were  estimated  when  the  Kriging  surrogate  model  was  trained.  The  maximum  correlation 
between  any  one  training  case  and  any  one  test  case  is  plotted  in  Figure  6 1 .  Figure  61a  shows 
this  maximum  correlation  at  the  Mach  0.3  flight  condition.  Figure  61b  shows  the  maximum 
correlation  at  the  Mach  0.8  flight  condition,  and  Figure  61c  shows  the  maximum  correlation  at 
the  Mach  2.5  flight  condition. 


The  largest  correlations  between  training  and  test  points  for  the  surrogates  trained  on  the  four 
smallest  data  sets  were  less  than  1“^  for  all  responses.  This  is  similar  to  the  behavior  observed 
in  the  large-scale  test  problem  where  there  was  minimal  correlation  between  training  and  test 
data,  resulting  in  a  total  dependence  on  the  underlying  trend  when  predicting  responses  for  the 
test  data  points.  When  the  correlation  coefficients  for  these  Kriging  models  were  investigated,  the 
surrogates  trained  on  the  smallest  data  sets  were  unable  to  identify  any  coefficient  values  that 
would  improve  surrogate  model  accuracy,  which  was  also  observed  in  the  large-scale  test 
problem.  These  lines  of  evidence  indicated  that  the  phenomenon  which  caused  such  trouble  in 
the  large-scale  test  problem  was  present  in  this  investigation,  affecting  the  surrogate  models 
trained  on  the  smallest  data  sets. 
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Figure  61:  Maximum  Correlatiou  Betweeu  Auy  Traiuiug  Case  &  Auy  Test  Case 


The  DACE  toolbox  was  first  able  to  identify  useful  eorrelation  eoeflheients  for  the  125 -ease  level 
of  the  SLHC.  For  this  data  set,  125  eases  were  analyzed  at  eaeh  flight  eondition.  After  stripping 
out  oases  which  did  not  converge  at  all  flight  conditions,  74  cases  were  left  in  the  data  set.  This 
would  have  been  too  few  to  fit  a  quadratic  underlying  trend,  supporting  the  decision  to  use  a 
linear  trend  for  this  investigation. 


It  was  hypothesized  that  the  DACE  toolbox  was  having  difficulty  identifying  useful  correlation 
coefficients  because  the  training  data  points  were  too  far  apart.  If  this  were  the  case,  the 
model-fitting  utility  would  not  find  any  useful  difference  between  different  coefficient  values 
because  no  two  training  points  would  be  close  enough  to  each  other  to  have  a  non-negligible 
correlation  over  the  ranges  being  evaluated.  To  test  this  hypothesis,  a  number  of  new  data 
points  were  analyzed  at  various  distances  from  one  of  the  training  points  in  the  smallest,  25- 
case  level  of  the  SEHC.  This  “seed  point”  was  selected  from  the  available  converged  data  set. 

8.2.12  Generating  Nearby  Samples 

New  data  points  were  generated  using  a  user-specified  distance-limiting  parameter  and  a  set 
of  random  numbers  to  perturb  the  new  point  away  from  the  initial  point.  Each  new  point  was 
perturbed  in  every  dimension  from  the  initial  point.  To  perturb  the  new  point,  the  user  would  set  a 
value  for  the  distance-limiting  parameter,  between  0  and  1 ,  which  indicated  how  far  the  new 
point  was  allowed  to  move  from  the  initial  point  in  each  dimension.  Two  random  numbers  on 
the  range  0-1  were  then  drawn  for  each  of  the  9  dimensions.  If  the  first  random  number  in 
each  pair  was  greater  than  0.5,  the  new  point  would  have  a  higher  value  in  that  dimension  than 
the  initial  point;  otherwise,  it  would  have  a  lower  value.  The  second  random  number  was  used 
to  determine  the  magnitude  of  the  perturbation  in  that  dimension. 


Sj  =  7?2  ^  Range,  x  Limit 


(42) 


•th 

Here,  d.  is  the  perturbation  that  is  applied  in  the  l  dimension;  R2  is  the  second  random 

•th 

number  that  was  generated;  Range^  is  the  size  of  the  design  space  in  the  l  dimension  (i.e.,  if 

the  f‘  parameter  ranged  from  5  to  25,  Range,  would  be  20);  and  Limit  is  the  user-specified 
distance  limit.  A  small  value  for  Limit  would  produce  new  samples  that  were  very  close  to  the 
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initial  point,  while  a  larger  Limit  value  would  allow  the  new  point  to  move  farther  away.  This 
perturbation  process  was  repeated  for  every  dimension  to  make  each  new  point.  If  the  new 
point  lay  beyond  the  edge  of  the  design  space  in  any  dimension,  it  was  moved  to  the  edge  of 
the  design  space  in  that  dimension. 

8.2.13  Evaluating  the  Use  of  Nearby  Samples 

Twelve  groups,  each  with  20  new  points,  were  generated  and  analyzed  with  CartSD.  The 
results  were  combined  with  the  results  for  the  first  level  of  the  SLHC,  and  new  Kriging 
surrogates  were  trained.  Each  of  the  new  samples  was  added  to  the  training  set  separately, 
and  the  combination  of  the  original  training  set  and  this  new  sample  were  used  to  train  a  new 
surrogate  model.  These  surrogates  were  then  inspected  to  determine  whether  the  DACE 
toolbox  had  been  able  to  identify  useful  correlation  coefficients. 

It  was  observed  that,  when  Limit  values  were  small,  adding  even  a  single  nearby  sample  to  the 
training  set  led  to  Kriging  surrogates  with  non-trivial  correlation  coefficients.  Eor  larger  values 
of  Limit,  it  became  less  likely  that  one  new  sample  would  be  so  useful.  Eor  each  Limit  value, 
the  fraction  of  samples  that  led  to  useful  correlation  coefficients  was  calculated;  the  results  are 
shown  in  Eigure  62. 


Disturbance  Size 
(%  of  Each  Dimension) 

Figure  62:  Fraction  of  Samples  That  Led  To  Useful  Correlation  Coefficients 

Eor  Limit  values  larger  than  0.2,  it  became  progressively  less  likely  that  the  new  sample  will 
lead  to  useful  correlation  coefficients,  indicating  that  the  closeness  of  the  training  data  -  or  at 
least,  the  closeness  of  two  or  more  points  within  the  training  data  -  was  of  great  importance 
when  the  DACE  toolbox  is  attempting  to  fit  a  Kriging  model.  It  should  also  be  noted  that,  by 
deliberately  placing  a  single  new  sample  close  to  an  existing  sample,  useful  correlation 
coefficients  could  be  identified  using  only  the  25-case  level  of  the  SLHC.  In  contrast,  when 
space-filling  samples  were  added  instead  (i.e.,  using  larger  levels  of  the  SLHC),  no  useful 
coefficients  were  identified  until  the  125-case  level. 

This  suggested  that,  although  spreading  samples  out  through  the  design  space  is  helpful  for 
developing  an  understanding  of  the  overall  response  behavior,  some  amount  of  clustering  is 
desirable  when  training  Kriging  models.  If  the  samples  are  spread  out  far  from  each  other  -  as 
was  the  case  in  these  experiments,  where  an  optimizer  was  used  to  maximize  the  distance 
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between  any  two  points  the  DACE  toolbox  may  not  be  able  to  fit  a  useful  model  to  the  data.  It 
would  appear  that  staeked  Latin  hypereubes  are  therefore  inappropriate  for  use  with  Kriging 
models,  particularly  for  large  design  spaces  or  when  only  a  limited  number  of  samples  will  be 
used. 

8.3  Review  &  Summary 

In  general  the  proposed  method  was  successful:  through  use  of  data  fusion  and  adaptive 
sampling,  surrogate  models  for  pitching  moments  were  more  accurate  even  when  trained  with 
fewer  cases,  a  result  that  was  very  clear  in  the  9-dimensional  test  problem.  For  lateral  responses, 
the  use  of  nuggets  improved  the  accuracy  of  surrogates  when  iteration  noise  was  an  important 
factor. 

Some  shortcomings  were  observed,  however.  Neither  the  proposed  method  nor  the  baseline 
approach  performed  well  when  applied  to  the  full-scale  test  problem.  The  surrogate  models 
for  pitching  moment  coefficient  for  Cm  at  Mach  0.8  grew  increasingly  and  rapidly  inaccurate  as 
more  samples  were  added  to  the  training  data,  indicating  that  the  effectiveness  of  the  proposed 
approach  is  dependent  on  the  difficulty  of  modeling  the  response.  When  the  estimated  behavior 
of  the  response  is  inaccurate,  the  samples  that  are  selected  may  not  improve  the  accuracy  of  the 
model.  Evidence  indicated  that  the  sheer  amount  of  distance  between  samples  was  a  major 
reason  why  surrogate  models  failed  to  identify  useful  correlation  coefficients;  for  future  efforts 
using  Kriging,  less  effort  should  be  spent  maximizing  the  space-filling  characteristics  of  the 
sampling  plan.  Instead,  some  mild  clustering  was  shown  to  be  beneficial  in  cases  of  sparse 
data  sets,  as  it  led  to  improved  estimation  of  correlation  coefficients.  If  the  initial  set  of 
samples  is  too  sparse,  the  adaptive  sampling  algorithm  can  be  directed  to  cluster  samples  by 
setting  the  POI  requirement  very  high  (as  in  Section  6.9),  although  the  effectiveness  of  this 
strategy  will  depend  on  how  close  the  candidate  samples  are  to  existing  training  samples. 

Additionally,  nuggets  were  introduced  as  a  response  to  the  relatively  large  iteration  noise  that 
was  observed  in  the  data  during  the  RBS  project.  For  responses  where  the  response  magnitude 
was  strongly  correlated  with  the  standard  deviation  of  the  iteration  noise,  the  use  of  nuggets  was 
shown  to  improve  predictive  accuracy.  However,  when  the  uncertainty  was  not  well-correlated 
with  the  amount  of  uncertainty  in  the  data,  no  significant  improvement  was  observed. 

The  resulting  surrogate  models,  although  often  improved  with  respect  to  the  baseline  approach, 
were  not  sufficiently  accurate  for  engineering  purposes.  However,  sufficient  evidence  was 
gathered  to  assess  the  effectiveness  of  the  proposed  approach.  The  effectiveness  of  this 
approach  was  found  to  increase  under  certain  conditions: 

•  When  the  responses  being  modeled  could  be  approximated  effectively  with  simple 
surrogates  despite  a  limited  quantity  of  data; 

•  When  the  cheaper  data  source  provided  useful  insight  into  the  behavior  of  the 
response  as  calculated  by  the  high-fidelity,  expensive  data  source;  and 

•  When  uncertainty  was  correlated  with  response  magnitude. 

For  problems  which  did  not  meet  one  or  more  of  these  conditions,  the  effectiveness  of  the 
proposed  approach  was  reduced  or  negated.  In  addition,  when  samples  were  too  spread  out 
from  one  another,  sparsity  effects  could  prevent  the  identification  of  useful  correlation 
parameters  for  Kriging  models.  This  limited  the  ability  of  the  Kriging  surrogate  models  to  fit 
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the  responses  well,  whieh  in  turn  handieapped  the  adaptive  sampling  algorithm  and  the 
overall  performanee  of  the  approaeh.  These  sparsity  eflfeets  eould  be  mitigated  by  mild 
elustering,  whieh  eould  be  aehieved  in  the  initial  set  of  samples  or  by  setting  high  POI 
requirements  for  some  of  the  adaptive  samples.  Investigations  into  the  degree  of  elustering 
required  for  good  performanee  was  left  for  future  work. 
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9  Summary,  Contributions  &  Conciusions 

This  dissertation  attempted  to  identify,  evaluate  and  demonstrate  a  new  sampling  and 
surrogate  modeling  proeedure  to  help  users  to  ereate  aeeurate  surrogate  models  of  expensive, 
high-fidelity  analysis  tools  at  a  more  reasonable  eost  than  was  previously  possible.  Sueh 
surrogate  models  would  allow  users  to  have  greater  eonfidenee  in  the  decisions  made  during  a 
design  effort,  reducing  the  risk  that  inadequate  modeling  fidelity  would  lead  to  dead  ends  and 
backtracking.  Previous  efforts  demonstrated  that,  although  valuable,  such  surrogate  models 
could  be  exceedingly  expensive  to  train  to  a  useful  level  of  accuracy. 

The  proposed  approach  therefore  emphasized  efficiency,  using  an  adaptive  sampling  algorithm 
to  identify  the  experiments  that  would  best  improve  the  surrogates,  in  order  to  minimize  the 
number  of  analyses  required  while  maximizing  the  improvement  produced  by  each  analysis. 

In  addition,  cheaper  sources  of  data  were  leveraged  when  possible,  and  uncertainty  in  data  values 
was  quantified  to  reduce  the  likelihood  of  over-fitting  a  noisy  response. 

At  the  end  of  Section  5,  this  research  effort  was  framed  in  the  context  of  a  sampling  and 
modeling  methodology.  The  methodology  included  multiple  steps,  from  the  initial  exploratory 
analyses  to  the  evaluation  of  surrogate  models.  In  particular,  the  research  questions  and 
hypotheses  that  drove  this  research  were  formulated  to  identify  the  most  effective  ways  to  carry 
out  each  step  of  the  methodology.  Note  that  the  methodology  was  deliberately  designed  to  be 
agnostic  with  respect  to  the  analysis  tools  used.  The  following  section  will  review  the  research 
questions  and  the  hypotheses  that  were  formulated  to  address  these  research  questions.  After 
that  review,  it  will  show  how  the  experimental  results  addressed  the  hypotheses  and  the  research 
questions  in  turn. 

9.1  Review  of  Research  Questions  &  Hypotheses 

The  primary  research  question  that  drove  this  effort  was: 

How  can  high-fidelity  modeling  be  feasibly  applied  earlier  in  the  design  process,  despite  the 
computational  expense? 

Based  on  observations  made  from  the  first  attempts  to  create  aerodynamic  surrogate  models 
of  reusable  boosters  (Section  4),  three  factors  (adaptive  sampling,  multi-fidelity  modeling, 
and  capturing  uncertainty)  were  identified  that  were  expected  to  be  a  more  effective  way  to 
approach  the  problem.  These  factors  led  to  focused  research  questions,  which  scoped  the 
problem  and  drove  the  literature  search  (Section  5).  The  literature  search,  in  turn,  identified 
methods  which  could  address  those  focused  research  questions;  these  expectations  were 
expressed  in  the  form  of  hypotheses,  and  experiments  were  designed  to  test  those  hypotheses. 

9.1.1  First  Focused  Research  Question  &  Hypothesis 

The  first  observation  was  that  the  pitching  moment  coefficient  for  many  configurations  was  so 
extreme  for  at  least  one  flight  conditions  that  they  were  unlikely  to  be  feasible  designs.  If 
samples  could  be  placed  to  emphasize  feasible  configurations,  surrogate  models  could  be 
created  that  were  accurate  for  the  feasible  regions  of  the  design  space  while  minimizing  the 
number  of  infeasible  configurations  analyzed.  The  fact  that  the  objective  was  a  particular  range 
of  each  response,  rather  than  a  maximum  or  minimum,  meant  that  common  sample  selection 
approaches  were  not  appropriate.  This  led  to  the  first  focused  research  question: 
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When  “good performance”  refers  to  responses  within  desirable  ranges  rather  than  maxima 
or  minima,  how  can  regions  of  good  performance  be  identified  and  emphasized  during  the 
sampling  process? 

A  review  of  available  sample  seleetion  methods  led  to  the  identifieation  of  eontour-based 
sampling  as  a  promising  approaeh.  This  formed  the  first  eontributing  hypothesis,  Hypothesis  1 : 

Contour-based  sampling  will  balance  the  selection  of  cases  with  good  performance  and  the  reduction  of 
prediction  uncertainty  in  promising  regions,  identifying  samples  that  efficiently  improve  surrogate  accuracy 
for  configurations  with  small  aerodynamic  moments. 

Seetion  6  detailed  the  implementation  of  the  method  and  the  experiments  that  tested  the 
hypothesis.  A  number  of  test  problems  were  used,  eaeh  of  whieh  had  one  or  more  responses  for 
whieh  only  a  partieular  range  of  response  values  were  of  interest.  Surrogate  models  were  then 
trained  with  data  sets  generated  with  both  the  proposed  approaeh  (using  eontour-based  sampling 
to  augment  a  set  of  initial  spaee-filling  samples)  and  with  the  baseline  approaeh  (using  only 
spaee-fdling  samples). 

In  most  oases,  eontour-based  sampling  led  to  surrogate  models  that  were  more  aoourate  than 
those  based  on  spaee-filling  samples.  The  exoeption  to  this  rule  was  when  the  underlying  trend 
of  the  Kriging  surrogate  model  was  a  poor  representation  of  the  response  behavior,  suoh  as  the 
Sphere  Funotion.  In  that  ease,  eontour-based  sampling  had  relatively  poor  performanoe  due  to  a 
mismatoh  between  the  perceived  response  behavior  (in  the  form  of  the  surrogate  model  used 
by  the  sample-seleetion  algorithm)  and  the  actual  response  behavior.  In  general,  however,  the 
tests  demonstrated  that  eontour-based  sampling  did  lead  to  surrogate  models  that  were  more 
aeeurate,  even  when  based  on  smaller  training  sets. 

Seetion  6.10  demonstrated  eontour-based  sampling  for  multiple  flight  eonditions.  Surrogate 
model  aeeuraey  was  evaluated  using  test  eases  that  had  small  aerodynamie  moments  at  eaeh 
flight  eondition.  Creating  a  training  data  set  with  eontour-based  sampling  led  to  surrogate 
models  that  were  more  aeeurate  than  if  only  spaee-filling  samples  were  used,  even  if  a  larger 
number  of  spaee-fdling  samples  were  available.  Thus,  eontour-based  sampling  was  shown  to 
“eflfieiently  improve  surrogate  aeeuraey  for  eonfigurations  with  small  aerodynamie  moments,” 
supporting  this  hypothesis. 

Contour-based  sampling  eflfeetively  identified  regions  of  good  performanoe  (i.e.,  regions  with 
response  values  within  a  speoified  range)  and  plaoed  samples  in  those  regions.  This  behavior 
produoed  surrogate  models  that  were  more  aoourate  over  that  speoified  response  range.  In  faot, 
eontour-based  sampling  was  so  eflfeotive  that  in  most  oases,  surrogate  model  prediotion 
aeeuraey  oould  be  improved  while  reducing  the  number  of  oases  used  to  train  the  surrogates.  In 
light  of  those  results,  the  researoh  question  was  oonsidered  to  have  been  addressed  satisfaotorily. 

9.1.2  Second  Focused  Research  Question  &  Hypothesis 

The  seoond  observation  was  that,  although  simpler  analysis  methods  suoh  as  APAS  were  not 
suffioiently  adequate  to  be  the  sole  souroe  of  data,  suoh  methods  oould  still  shed  light  on  the 
overall  trends  in  response  behavior.  If  so,  this  oould  substantially  reduoe  the  number  of 
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expensive  analyses  neeessary  to  train  aeeurate  surrogate  models.  This  led  to  the  seeond  foeused 
research  question: 

How  can  cheaper  analyses  be  integrated  with  high-fidelity  models  to  reduce  the  overall  cost  of 
design  space  exploration  or  exploitation? 

The  process  of  combining  information  from  multiple  data  sources  is  known  as  data  fusion  or,  in 
situations  where  some  data  sources  are  more  accurate  than  others,  multi-fidelity  modeling. 
Many  alternative  methods  for  data  fusion  exist;  it  can  be  diflhcult  to  know  which  one  is  best- 
suited  for  the  problem  at  hand.  The  second  focused  hypothesis,  therefore,  did  not  identify  one 
data  fusion  method  in  particular: 

Data  fusion  techniques  will  allow  results  from  high-fidelity  analyses  to  be  augmented  with 
cheaper  sources  of  data  to  produce  surrogate  models  that  are  more  accurate  yet  require  less 
computationally-expensive  data. 

This  hypothesis  was  tested  in  a  previous  section.  Four  data  fusion  methods  -  additive 
correction,  proportional  correction,  Ghoreyshi  cokriging,  and  data  harmonization  -  were 
implemented  and  applied  to  various  test  problems  that  were  similar  to  the  intended  application 
of  reusable  booster  aerodynamics.  Based  on  those  tests,  one  of  those  techniques  (Ghoreyshi 
cokriging)  was  selected  since  it  produced  the  most  accurate  surrogate  models  for  the  test 
problems. 

Here,  the  experiments  served  two  purposes:  they  compared  data  fusion  techniques  against  each 
other  to  identify  the  most  effective  approach,  and  they  compared  those  techniques  against  the 
standard  single-fidelity  approach  to  confirm  that  data  fusion  would  produce  better  surrogates. 
To  be  precise,  the  hypothesis  was  tested  by  the  latter  comparison;  the  evaluation  of  Ghoreyshi 
cokriging  against  other  data  fusion  techniques  served  simply  to  determine  which  technique  was 
the  most  promising. 

Surrogates  created  using  Ghoreyshi  cokriging  were  more  accurate  than  the  standard  single¬ 
fidelity  surrogates  for  all  test  problems.  Those  results  supported  the  hypothesis  that  data 
fusion  would  lead  to  surrogates  that  were  “more  accurate”  yet  required  “less  computationally- 
expensive  data.”  In  particular,  surrogates  created  with  Ghoreyshi  cokriging  were  more  accurate 
than  single-fidelity  surrogates  that  had  been  trained  with  much  more  data  from  the  expensive  data 
source.  These  results  were  strong  enough  that  the  second  focused  hypothesis  was  considered  to 
be  supported.  The  results  also  demonstrated  that  data  from  multiple  sources  were  being 
blended  together  to  produce  surrogate  models  that  were  more  accurate  while  requiring  fewer 
expensive  analyses,  allowing  design  space  exploration  and  exploitation  to  be  conducted  at 
reduced  cost.  Thus,  the  second  focused  research  question  had  been  addressed  satisfactorily. 

9.1.3  Third  Focused  Research  Question  &  Hypothesis 

The  third  observation  was  that  predictive  accuracy  for  lateral  responses  was  very  poor,  and  this 
poor  accuracy  was  due  in  part  to  the  relatively  noisy  behavior  of  the  responses.  By  ignoring  that 
noise  and  treating  the  responses  as  deterministic,  the  surrogate  models  were  actually  less 
accurate  than  if  the  surrogate  were  replaced  with  the  response  mean.  Identifying  which  data 
points  were  accurate  and  which  were  spurious  might  lead  to  a  more  accurate  surrogate. 
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However,  most  surrogate  modeling  techniques  could  not  take  such  information  into  account, 
leading  to  third  focused  research  question; 

How  can  information  about  uncertainty  in  the  data  be  captured  effectively? 

Although  most  engineering  applications  of  Kriging  assume  that  the  training  data  is 
deterministic,  an  alternative  formulation  using  nuggets  was  identified  that  could  account  for 
uncertainty  in  the  data.  Critically,  this  approach  allowed  the  user  to  specify  a  different 
uncertainty  magnitude  for  each  data  point.  This  led  to  the  third  hypothesis: 

When  creating  a  Kriging  model,  the  use  of  nuggets  will  capture  uncertainty  in  the  data, 
improving  predictive  accuracy  for  noisy  responses. 

This  hypothesis  was  also  tested  in  a  previous  section.  The  experiment  demonstrated  that 
predictive  accuracy  for  yawing  moment  coeificient  could  be  substantially  improved  when 
sources  of  uncertainty  (especially  iteration  noise)  were  captured  via  nuggets  when  training  the 
Kriging  model.  This  effect  was  significant  for  noisy  data  but  was  negligible  when  there  was  not 
much  uncertainty  present  in  the  data. 

The  results  showed  that  the  use  of  nuggets  did  in  fact  improve  predictive  accuracy  for  noisy 
responses,  giving  support  to  the  hypothesis.  This  indicated  that  nuggets  serve  to  answer  the 
focused  research  question,  allowing  the  user  to  capture  uncertainty  in  the  response  in  an 
effective  manner,  such  that  the  resulting  surrogate  models  would  be  more  accurate  than  the 
deterministic  surrogates  that  are  common  in  engineering. 

9.1.4  Primary  Research  Question  &  Final  Hypothesis 

Each  of  the  three  research  questions  were  formulated  to  address  a  factor  that  made  it  difficult 
to  make  surrogate  models  of  expensive  high-fidelity  data  sources,  and  thus  each  question 
addressed  an  aspect  of  the  primary  research  question: 

How  can  high-fidelity  modeling  be  feasibly  applied  earlier  in  the  design  process,  despite  the 
computational  expense? 

Each  supporting  hypothesis  highlighted  a  technique  to  address  this  primary  research  question; 
contour-based  sampling  optimized  the  selection  of  samples;  data  fusion  incorporated  cheaper 
data  sources;  and  nuggets  captured  noise  or  uncertainty  in  the  observed  data.  It  was  asserted 
that,  by  combining  these  techniques  into  a  coherent  approach,  the  primary  research  question 
might  be  answered.  This  assertion  was  itself  the  final  hypothesis  of  the  research  effort: 

By  placing  samples  intelligently,  reducing  dependence  on  the  expensive  models,  and 
accounting  for  any  uncertainty  in  the  data,  the  selected  methods  will  enable  improved 
surrogate  model  accuracy  with  significantly  reduced  data  requirements,  such  that  high- 
fidelity  modeling  becomes  a  feasible  option  earlier  in  the  design  process. 

This  hypothesis  was  tested  by  applying  it  to  a  series  of  representative  problems  of  increasing 
complexity.  These  tests  were  described  in  depth  in  Section  8.  In  the  first  test,  the  combined 
techniques  were  applied  to  a  problem  with  3  responses  (all  of  which  were  used  for  adaptive 
sampling)  and  9  free  parameters.  This  test  focused  on  predicting  pitching  moment  coeificient, 
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which  exhibited  relatively  low  uncertainty;  as  a  result,  only  data  fusion  and  adaptive  sampling 
were  used.  The  eombined  techniques  were  shown  to  be  quite  effeetive  at  improving 
predietive  aeouraey  for  this  problem,  supporting  the  hypothesis. 

As  a  final  test,  the  eombined  teehniques  were  applied  to  a  problem  with  12  responses  (4  of 
whieh  were  used  for  adaptive  sampling)  and  49  free  parameters.  This  test  would  quantify  how 
the  eombined  teehniques  performed  when  applied  to  the  motivating  problem  of  reusable  booster 
design.  The  test  showed  that  all  Kriging  models  -  whether  produeed  by  the  baseline  approaeh 
or  by  the  proposed  approach  -  had  difficulty  fitting  the  responses.  Evidenee  indieated  that  this 
was  due  to  the  large  distanees  between  the  samples.  When  surrogates  were  fit  to  the  lateral 
responses,  the  proposed  approaeh  out-performed  the  baseline  approaeh  for  responses  where 
noise  was  signifieant,  but  had  roughly  equivalent  performanee  where  noise  did  not 
significantly  affect  the  response  value.  With  regard  to  the  longitudinal  responses,  data  fusion 
provided  the  bulk  of  the  observed  improvements,  whieh  was  unsurprising  given  that  the 
effeetiveness  of  eontour-based  sampling  was  shown  to  depend  on  the  aeeuracy  of  the  available 
surrogate  models  when  evaluating  new  samples. 

The  large-seale  problem  gave  weak  evidenee  to  support  the  final  hypothesis:  although  the 
eombined  teehniques  out-performed  the  baseline  approaeh,  eontour-based  sampling  was  not 
particularly  effective  since  it  was  difficult  for  the  algorithm  to  assess  candidates  with  any 
accuracy.  However,  the  smaller-seale  problem  did  demonstrate  that  when  the  Kriging  models 
were  moderately  aeeurate,  the  eombined  techniques  eould  produce  signifieant  improvements  in 
predietive  aeeuraey.  The  eombined  teehniques  were  shown  to  improve  predietive  aeeuraey 
while  redueing  dependenee  on  the  expensive  data  souree,  supporting  the  final  hypothesis  and  in 
turn  addressing  the  primary  researeh  question:  the  selected  teehniques  redueed  the 
eomputational  expense  of  high-fidelity  modeling  by  a  substantial  margin,  enabling  design  space 
exploration  or  optimization  at  a  more  reasonable  eost  than  was  possible  before. 

The  speeifie  steps  of  the  proposed  method  will  be  reviewed  in  the  next  section. 

9.2  Review  of  Steps  in  the  Method 

Section  5.9. 1  described  the  generie  approaeh  to  ereating  surrogate  models.  This  seetion  will 
review  those  steps  in  light  of  the  experimental  results  that  have  been  observed,  elearly 
identifying  the  way  that  eertain  steps  have  been  updated  in  light  of  observations  made  during  this 
researeh.  The  updated  steps  are  illustrated  in  fiowehart  form  in  Figure  63. 

Step  1:  Generate  an  initial  set  of  samples  to  be  analyzed. 

The  use  of  data  fusion  will  affect  the  way  that  the  initial  samples  should  be  selected,  as  well  as 
the  way  that  the  surrogate  models  are  trained.  Sample  distributions  sueh  as  nested  Latin 
hypereubes,[152]  slieed Latin  hypereubes, [153]  and  staeked  Latin  hypereubes  (Seetion  8.2.4) 
allow  the  user  to  generate  sample  distributions  that  ean  fill  multiple  roles  simultaneously:  the 
overall  set  of  samples  is  large  and  space-filling,  suitable  for  the  data  souree  that  has  low  per- 
analysis  eosts,  while  subsets  are  identified  that  are  mueh  smaller  yet  still  retain  good  spaee- 
filling  eharaeteristies,  suitable  for  the  higher-fidelity  data  souree  with  its  greater  per-analysis 
eosts. 
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Step  2:  Analyze  the  samples  using  the  appropriate  data  sources. 

Step  3:  Train  Kriging  surrogate  models  using  the  resulting  data.  Nuggets  can  he  used  to 
capture  the  relevant  uncertainties  in  the  data,  while  Ghoreyshi  cokriging  will  produce 
surrogates  which  are  more  accurate  hut  require  less  investment  in  training  data. 

It  should  be  noted  that  the  etfectiveness  of  a  data  fusion  teehnique  will  depend  on  the  problem 
being  addressed.  For  the  applieations  described  in  this  work,  Ghoreyshi  cokriging  produced 
the  largest  improvement  in  predictive  accuracy,  but  this  may  not  be  the  case  for  every 
application. 
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Figure  63:  Updated  Methodology  for  Sample  Selection  &  Surrogate  Model  Creation 


Step  4:  Evaluate  the  resulting  surrogate  models  to  quantify  the  predictive  accuracy. 

In  these  experiments,  a  separate  set  of  test  samples  were  used  to  evaluate  the  predictive 
accuracy.  Cross  validation  was  not  used  because  data  was  (relatively)  cheap,  allowing  some  of 
it  to  be  used  purely  for  test  purposes,  and  because  of  the  particular  characteristics  of  the 
problems  being  addressed:  for  some  responses,  predictive  accuracy  was  only  important  for  cases 
with  response  values  within  a  certain  range.  Had  cross  validation  been  used  with  the  data  sets 
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that  were  available,  only  a  handful  of  points  would  have  been  fallen  within  that  range,  leading  to 
large  uneertainty  in  the  error  estimates. 

Steps  5a  &  5b:  If  the  surrogate  models  are  sufficiently  accurate  or  the  project  resources 
have  been  consumed,  terminate  the  process. 

Step  6:  Otherwise,  select  new  samples  for  analysis  using  contour-based  sampling  and  go  to 
Step  2, 

Contour-based  sampling  was  selected  as  the  adaptive  sampling  method  of  choice  in  light  of 
the  experimental  results.  Before  applying  contour-based  sampling,  the  user  should  refer  to 
validation  test  results  for  all  data  sources  to  determine  what  the  response  range(s)  of  interest 
may  be,  accounting  for  any  observed  biases  or  uncertainty  in  the  results. 

The  method  defined  by  these  steps  was  shown  to  perform  well  for  problems  which  match  the 
conditions  for  which  it  was  designed;  design  spaces  which  are  large  but  contain  only  a  small 
feasible  space  constrained  by  the  allowable  values  of  one  or  more  responses;  responses  which 
can  be  approximated  moderately  well  using  a  cheaper  source  of  data;  and  responses  with  a 
degree  of  uncertainty  that  is  large  relative  to  the  scale  of  the  response  being  modeled.  When 
one  or  more  of  those  qualities  is  not  present,  the  proposed  method  may  not  have  an  advantage 
over  the  baseline  approach  of  space-filling  samples  and  deterministic  single- fidelity  modeling. 

In  the  course  of  developing  and  demonstrating  this  method,  a  number  of  contributions  to  the 
field  of  advanced  design  methods  were  made.  These  contributions  will  be  reviewed  in  the 
next  section. 

9.3  Contributions 

The  contributions  that  resulted  from  this  research  effort  may  best  be  expressed  by  grouping  them 
by  topic.  On  the  topic  of  sample  selection  techniques: 

•  Contour-based  sampling  has  been  extended  to  address  multiple  responses  simultaneously. 

•  The  Probability  of  Interest  (POI)  requirement  was  introduced  to  allow  the  user  to 
determine  the  algorithm’s  behavior  and  balance  exploration  of  regions  that  may  be  of 
interest  against  exploitation  of  regions  that  are  known  to  be  of  interest. 

•  The  effects  of  varying  the  POI  requirement  were  demonstrated  for  a  representative 
aerospace  problem,  illustrating  the  changes  in  algorithmic  behavior  that  resulted. 

•  The  computational  savings  produced  by  the  use  of  Schur’s  Complement  rather  than 
direct  matrix  inversion  were  quantified  for  a  problem  with  dozens  of  variables. 

•  Stacked  Latin  hypercubes  were  proposed  and  demonstrated.  Tests  showed  them  to  have 
superior  space-filling  qualities  when  compared  against  other  progressive  sampling 
approaches  found  in  the  literature,  although  the  costs  of  creating  the  sample  design  were 
much  higher. 

•  It  was  shown  that  these  well-spread-out  sample  sets  could  be  difficult  to  fit  using  Kriging 
models;  when  the  design  space  is  large  or  very  few  samples  will  be  available,  it  may  be 
preferable  to  allow  moderate  clustering  so  that  the  Kriging  surrogate  can  better  identify 
correlation  in  the  data. 
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On  the  topic  of  the  creation  of  surrogate  models: 

•  Data  harmonization  was  demonstrated  for  the  first  time  for  a  problem  outside  its 
field  of  origin,  geostatistics. 

•  A  variant  of  Ghoreyshi  cokriging,  in  which  all  low-fidelity  response  values  were 
incorporated  rather  than  only  the  one  which  corresponded  to  the  high-fidelity  response 
being  modeled,  was  implemented  and  evaluated.  It  was  not  found  to  offer  improved 
performance. 

•  Although  the  use  of  nuggets  to  capture  uncertainty  for  Kriging  models  was  shown  to 
improve  the  accuracy  of  those  models  when  fitting  noisy  responses,  it  was 
demonstrated  that  such  information  is  not  preserved  in  the  prediction  variance  of  the 
model. 

9.4  Future  Work 

The  work  presented  in  this  document  was  intended  to  be  thorough,  but  by  no  means 
exhaustive.  There  are  a  number  of  avenues  for  further  work  through  which  the  proposed 
method  might  be  improved  or  augmented: 

Most  importantly,  at  the  present  time  the  user  must  set  values  for  the  probability  of  interest 
(POI)  requirement,  the  number  of  candidate  points  to  evaluate,  and  the  number  of  test  points  to 
use  for  evaluation  when  using  contour-based  sampling.  The  values  selected  can  significantly 
affect  the  behavior  of  the  algorithm,  as  shown  in  Section  6.9  and  Appendix  C:  ,  but  at  present  it 
is  difficult  to  determine  what  the  most  effective  values  should  be. 

In  addition,  it  is  expected  that  as  the  number  of  candidate  and  test  points  are  increased,  some 
point  of  diminishing  returns  would  be  encountered  -  for  example,  beyond  some  density  of 
test  points,  adding  an  additional  test  point  might  not  increase  the  accuracy  of  the  algorithm. 
Although  the  effectiveness  of  the  additional  test  point  would  be  reduced  or  negated,  there 
would  still  be  an  incremental  cost  to  evaluate  how  that  test  point  is  affected  by  each 
candidate  point.  Similarly,  although  a  larger  number  of  candidates  would  be  expected  to 
increase  the  chance  that  a  highly  informative  candidate  might  be  identified,  a  similar  point  of 
diminishing  returns  without  diminishing  costs  might  be  expected. 

Also  of  interest  is  the  other  end  of  the  spectrum:  how  few  test  points  is  too  few?  At  what  point 
does  the  decreased  accuracy  when  evaluating  candidates  outweigh  the  computational  savings? 
How  small  can  the  pool  of  candidates  become  before  the  effectiveness  of  the  sampling  algorithm 
becomes  handicapped? 

It  is  possible  that  the  optimal  value  (or  schedule  of  values)  for  each  of  these  parameters  could  be 
highly  problem-dependent.  For  example,  in  Section  6.9  it  was  shown  that  POI  requirements 
much  above  10  percent  served  to  make  the  algorithm  overly-cautious,  limiting  its  ability  to 
identify  the  region  of  interest.  This  behavior  stemmed  from  the  fact  that,  especially  early  on,  the 
surrogate  models  were  poor  representations  of  the  responses  being  emulated.  As  a  result,  cases 
which  appeared  promising  were  of  little  interest  and  vice  versa.  Without  low  POI  requirements, 
the  algorithm  could  not  do  the  exploration  necessary  to  build  an  accurate  understanding  of  the 
responses.  Once  the  models  had  improved,  a  higher  POI  requirement  might  have  led  to  sampling 
that  better  clustered  cases  near  the  region  of  interest.  If  this  behavior  could  be  anticipated,  an 
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efficient  schedule  of  POI  requirements  and  candidate  &  test  pool  sizes  could  be  developed, 
whether  that  schedule  be  universal  or  tunable  to  various  types  of  problems. 

Lastly,  it  was  shown  that  sparse  sample  sets  led  to  difficulty  in  fitting  accurate  Kriging 
surrogates,  and  that  clustering  helped  to  address  that  difficulty.  However,  the  details  of  that 
clustering  have  yet  to  be  addressed  rigorously.  Presently  the  only  way  that  an  overly-sparse 
data  set  may  be  identified  is  when  the  DACE  Kriging  toolbox  fails  to  find  useful  correlation 
coefficients.  After  this  occurs,  any  attempts  to  address  the  problem  -  typically  by  clustering 
more  samples  near  existing  samples,  whether  manually  or  by  setting  a  high  POI  requirement  for 
the  sampling  algorithm  -  would  be  by  their  nature  reactive  rather  than  proactive.  An  investigation 
into  the  effects  of  sparsity,  as  well  as  ways  to  predict  and/or  mitigate  those  effects,  would  be 
greatly  beneficial  when  the  data  set  may  be  sparse. 

9.5  Final  Remarks 

If  more  accurate  information  could  be  made  available  to  decision-makers  early  in  the  design 
process,  trade  studies  and  optimizations  could  be  carried  out  with  greater  confidence.  By 
ensuring  that  decisions  are  made  using  sufficiently  accurate  understanding  of  all  consequences, 
the  risk  that  later  analysis  will  reveal  a  previously-unsuspected  deficiency  in  the  design  may  be 
reduced  or  eliminated. 

The  process  of  selecting  experiments  to  acquire  the  most  useful  data  is  a  perennial  problem,  and 
often  a  thorny  one.  It  becomes  even  more  challenging  when  resources  are  in  short  supply  and 
no  analysis  can  be  wasted.  In  such  situations,  special  care  must  be  taken  to  ensure  that  each 
analysis  that  is  performed  is  as  informative  as  possible.  This  process  can  be  complicated 
when  several  responses  must  be  taken  into  account  at  once. 

This  research  effort  attempted  to  contribute  to  those  goals  by  identifying;  techniques  for 
selecting  the  minimum  set  of  samples  while  maximizing  the  useful  information  obtained; 
techniques  for  leveraging  cheaper  sources  of  data  to  reduce  dependence  on  accurate-but- 
expensive  analyses;  and  techniques  for  identifying,  tracking  and  capturing  any  uncertainty 
present  in  the  data  being  modeled.  The  resulting  approach  to  sampling  and  modeling  was 
shown  to  be  effective  for  the  most  part,  improving  predictive  accuracy  while  reducing  the 
number  of  expensive  analyses  required. 

The  proposed  method  is  not  a  silver  bullet  -  for  example,  the  time  &  effort  required  to  select 
each  sample  may  be  non-negligible.  This  method  is  most  effective  when  the  per-evaluation  cost 
of  the  primary  data  source  is  high  compared  to  the  costs  of  the  sample  selection  process. 
Although  the  problem  of  expensive  analyses  is  not  entirely  negated,  it  is  hoped  that  this  effort 
has  contributed  a  useful  step,  however  small,  toward  bringing  the  problem  down  to  manageable 
size. 
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Appendix  A:  A  Practical  Guide  for  Efficient  Surrogate 
Modeling 

A.  1  Overview 

The  goal  of  this  work  is  to  create  accurate  surrogate  models  while  minimizing  the  cost  of 
the  necessary  data.  The  surrogate  models  will  emulate  analysis  methods  such  as 
computational  tools  or  physical  measurements.  The  method  described  in  this  guide 
combines  three  techniques  -  adaptive  sampling,  data  fusion,  and  Kriging  nuggets  -  which 
serve  to  improve  predictive  accuracy  and/or  reduce  the  cost  of  acquiring  the  data  required 
to  train  useful  surrogate  models. 

This  guide  will  help  the  reader  apply  the  method.  The  guide  assumes  that  the  reader  has 
already  set  up  the  problem,  i.e.,  that  reader  has  already  identified  the  independent  and 
dependent  parameters  (e.g.,  the  inputs  &  responses).  [96]  If  possible,  screening  tests  should 
be  performed  to  minimize  the  number  of  independent  parameters.  The  user  should  also 
have  selected  appropriate  sources  of  data  (e.g.,  CFD  models  or  physical  measurements) 
based  on  the  level  of  accuracy  that  is  required:  greater  accuracy  typically  requires  more 
expensive  sources  of  data.  The  guide  will  assume  that  the  reader  is  familiar  with  advanced 
design  methods  such  as  surrogate  modeling  and  design  space  exploration;  readers 
unfamiliar  with  these  topics  may  wish  to  review  the  survey  articles  of  Shan  &  Weng[174, 
189]  and  the  text  Response  Surface  Methodology  by  Myers,  Montgomery,  and  Anderson- 
Cook.[134] 

This  method  may  not  be  appropriate  for  problems  that  are  highly  resource-constrained.  A 
common  rule  of  thumb  for  the  amount  of  data  necessary  for  accurate  surrogate  model  is  at 
least  \M  applications  of  each  data  source,  where  d  is  the  number  of  dimensions  (i.e.,  free 
parameters)  being  investigated  simultaneously.  [5  8,  106]  If  resource  limits  will  not  allow  the 
user  to  carry  out  the  recommended  lOJ  analyses  using  the  desired  data  source,  it  is 
unlikely  that  an  accurate  surrogate  model  can  be  trained.  There  are  exceptions  -  if  the 
response  behavior  is  very  simple,  if  the  ranges  of  the  input  parameters  are  very  small,  etc.  - 
but  any  problem  which  requires  such  expensive  analyses  is  unlikely  to  be  so 
accommodating.  If  the  problem  is  resource-constrained  in  this  manner,  a  surrogate  model 
may  be  trained  to  match  cheaper  data  sources,  while  the  expensive  data  source  is  used 
directly  for  verification  and/or  correction  of  the  cheaper  data  in  a  manner  similar  to  the 
Pegasus  booster  design  process. 

During  the  design  of  the  Pegasus  booster  in  the  early  ’90s,  project  resource  limitations 
meant  that  only  a  few  expensive  Navier-Stokes  simulations  could  be  performed.  The 
designers  used  cheaper  sources  of  data  to  design  the  vehicle  &  trajectory.  The  expensive 
simulations  were  used  to  investigate  phenomena  that  the  cheaper  simulations  wouldn’t 
capture,  such  as  interactions  between  shock  waves  and  boundary  layers,  to  determine 
whether  those  phenomena  would  significantly  affect  the  vehicle’s  performance.  The 
expensive  results  were  also  used  to  confirm  some  of  the  predictions  of  the  cheaper  data 
sources.  This  gave  the  designers  confidence  in  the  cheaper  simulations  while  minimizing 
the  use  of  expensive  simulations. [130]  If  such  an  approach  is  not  acceptable,  the  user  is 
urged  to  reduce  the  scope  of  the  problem  by  eliminating  free  parameters  (thus  reducing  the 
lOJ  samples  recommended)  or  using  an  alternative,  cheaper  data  source. 
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This  remainder  of  this  guide  will  assume  that  the  problem  at  hand  has  at  least  enough 
resourees  to  perform  lOJ  analyses  with  eaeh  data  souree.  This  guide  deseribes  a  method  for 
making  the  best  use  of  those  analyses.  The  steps  of  the  method  are  given  in  detail  beginning 
in  Seetion  A.3  .  In  some  seetions,  example  eommands  are  given  to  elarify  how  to 
aeeomplish  eertain  tasks,  sueh  as  ereating  a  surrogate  model  that  eombines  multiple  sourees 
of  data.  These  example  eommands  are  written  for  Matlab,  as  this  was  the  environment  used 
for  the  original  implementation. 

The  method  being  deseribed  was  developed  for  predieting  aerodynamies,  so  the  deseription 
will  be  in  those  terms.  Note  that,  although  it  was  developed  for  an  aerodynamies  problem, 
the  method  is  not  restrieted  to  that  field  and  ean  be  applied  to  any  problem  that  has  the 
appropriate  eharaeteristies.  Those  eharaeteristies  are  speeified  in  the  next  seetion. 

A.2  Intended  Applications 

First,  the  problem  of  interest  must  be  complex  enough  that  rapid  analysis  methods,  sueh  as 
handbook  methods  or  panel  methods,  are  not  suflfieiently  aeeurate.  Instead,  more  eomplex 
methods,  sueh  as  eomputational  fluid  dynamies  (CFD),  are  neeessary  to  eapture  all  the 
relevant  phenomena.  Use  of  these  eomplex  methods  means  inereased  effort  per  analysis, 
whether  physieal  or  eomputational.  Due  to  the  inereased  eost  of  data,  it  may  not  be 
possible  to  eonstruet  adequate  surrogate  models  using  standard  teehniques.  For  simple 
problems,  where  phenomena  sueh  as  viseous  or  nonlinear  eflfeets  are  insignifieant,  there  is 
no  ineentive  to  use  any  speeial  teehniques  to  reduee  the  eomputational  effort.  Simple 
problems  ean  often  be  analyzed  by  quiek-to-exeeute  tools  in  less  than  a  seeond.  It  is  usually 
more  eflfeetive  to  analyze  a  large  number  of  than  it  is  to  spend  time  ealeulating  whieh 
single  sample  would  be  the  most  informative  to  obtain. 

The  term  sample  in  this  eontext  refers  to  the  eombination  of  inputs  to,  and  outputs  from, 
one  analysis.  If  a  eomputational  tool  is  used,  a  sample  is  obtained  when  the  data  souree 
is  used  to  determine  the  response(s)  for  a  partieular  set  of  input  values.  The  term  analysis 
refers  to  the  proeess  of  determining  the  response  values  that  eorrespond  to  a  partieular  set 
of  input  values.  An  analysis  may  be  the  applieation  of  a  eomputer  model  or  the 
measurement  of  a  physieal  system.  Part  of  the  goal  of  the  teehnique  is  to  identify  the  most 
eflfeetive  &  informative  samples  to  analyze. 

Seeondly,  this  teehnique  is  most  eflfeetive  when  only  a  certain  range  of  the  response  is  of 
interest.  The  teehnique  seeks  to  improve  predietive  aeeuraey  for  samples  likely  to  have  a 
response  value  within  that  range.  If  there  are  multiple  responses,  the  teehnique  seeks  to 
improve  predietive  aeeuraey  for  samples  whieh  fall  within  the  speeified  ranges  for  all 
responses. 

As  an  example,  the  teehnique  was  developed  to  ereate  aeeurate  models  of  vehiele 
performanee  at  multiple  flight  eonditions.  It  was  found  that  the  pitehing  moment  eoeffieient 
was  a  eritieal  response  when  evaluating  a  possible  design:  if  the  pitehing  moment  eoeffieient 
at  any  flight  eondition  was  too  large  ,  the  design  would  probably  not  be  eontrollable.  As  a 
result,  this  teehnique  helped  to  identify  designs  that  were  likely  to  have  small  pitehing 
moment  eoeflfieients  at  all  flight  eonditions.  Those  designs,  onee  analyzed,  eould  be  used 
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to  improve  the  surrogate  models.  In  this  case,  the  response  values  of  interest  were  defined 
as  a  range. 

Alternatively,  this  technique  has  been  used  for  predicting  structural  failure.  The  user  would 
define  the  response  threshold(s)  of  interest,  e.g.,  the  limit  strength  of  a  beam,  and  the 
technique  would  identify  analyses  that  would  enhance  the  surrogate  model’s  accuracy 
when  predicting  whether  or  not  the  response  for  a  given  sample  would  exceed  the  specified 
limit.  By  performing  those  analyses  and  adding  them  to  the  data  set  used  to  train  the 
surrogates,  the  surrogate  models  became  better  predictors  of  system  reliability  than  when 
this  technique  was  not  applied.  In  that  case,  the  response  values  of  interest  were  defined 
using  one  threshold  value  for  each  response. 

Thirdly,  this  technique  is  intended  forproblems  where  multiple  sources  of  data  are  available. 
Typically  the  sources  of  data  will  have  different  levels  of  expense  associated  with  them: 
a  computational  model  might  take  a  few  minutes  or  hours  to  complete  one  analysis, 
while  a  flight  test  might  require  months  of  preparation  and  millions  of  dollars.  When  there 
is  a  large  discrepancy  in  the  expense  associated  with  the  available  data  sources,  a  well- 
designed  set  of  analyses  can  offer  significant  savings  over  a  more  haphazard  approach.  It 
is  possible  to  incorporate  any  number  of  data  sources, [83]  but  for  the  sake  of  simplicity 
this  guide  will  assume  that  only  two  sources  of  data  are  available  and  that  one  data  source 
has  much  greater  per-analysis  costs  than  the  other. 

Most  likely,  the  cheaper  data  source  is  a  computational  model.  This  data  source  is 
unlikely  to  be  accurate  enough  for  use  as  the  main  source  of  data  (or  else  why  use  the 
more  expensive  source  at  all?).  Still,  the  cheaper  source  should  be  accurate  enough  to 
capture  trends  in  the  response  behavior.  For  an  aerodynamics  analysis,  the  main  source  of 
data  might  be  wind  tunnel  analyses  while  the  cheaper  source  of  data  might  be  a 
computational  tool  based  on  the  Euler  or  Navier-Stokes  equations. 

Lastly,  the  technique  is  intended  for  problems  where  significant  uncertainty  may  be  present 
in  one  or  more  responses.  This  uncertainty  can  stem  from  many  sources,  such  as  difficulty 
in  measuring  the  response,  random  noise,  or  errors  introduced  by  the  numerical  solver  of  a 
computational  tool. 

The  technique  can  be  applied  to  problems  which  do  not  have  all  four  of  the 
characteristics  described  in  this  section.  If  one  or  more  characteristics  are  not  present,  the 
technique  can  be  adjusted  to  take  advantage  of  that  fact;  likely  adjustments  will  be  called  out 
during  as  needed  in  the  following  sections. 

A. 3  Setting  Up  the  Problem 

As  previously  stated,  this  guide  assumes  that  the  user  has  already  identified  appropriate  data 
sources,  input  variables  and  responses.  A  number  of  techniques  exist  to  help  the  user 
minimize  the  number  of  input  variables,  such  as  screening,  parameter  mapping,  and 
decomposition.  [174]  If  the  data  sources  have  not  been  validated  for  the  problem  at  hand, 
validation  tests  should  be  performed  to  quantify  the  accuracy  of  each  data  source  for  the 
relevant  applications:  in  vehicle  aerodynamics,  for  example,  validation  tests  should  be 
performed  at  each  flight  condition  of  interest,  as  aerodynamics  analysis  tools  may  be 
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accurate  for  some  flight  conditions  yet  perform  poorly  for  others.  In  a  sense,  validation  may 
be  considered  Step  Zero,  as  it  must  be  eompleted  before  the  teehnique  ean  be  applied. 
Validation  eompares  the  response  values  predieted  by  the  data  souree  in  question  against  the 
“true”  values,  which  come  from  some  higher-fidelity  data  souree  sueh  as  physieal 
measurements.  This  validation  data  may  be  purpose-generated  or  compiled  from  the 
available  literature. [32,  59,  188,  190] 

“Fidelity”  in  this  eontext  refers  to  “the  degree  to  which  a  model  is  an  aeeurate  representation 
of  the  real  world  for  the  intended  uses  of  the  model”.  [8]  For  an  aerospace  vehicle,  the 
highest-fidelity  data  souree  would  be  a  full-seale  test  of  the  vehiele  at  every  relevant 
flight  eondition.  Naturally,  generating  that  data  ean  be  quite  expensive.  Instead,  simpler 
data  sources  such  as  wind  tunnels  may  be  used,  although  it  is  important  to  make  sure  that 
the  simplifieation  does  not  leave  out  any  important  effeets  such  as  those  dependent  on 
Reynolds  number.  [  1 7] 

A.4  Selecting  Initial  Analyses 

Onee  the  data  sources  have  been  validated,  it  is  time  to  ehoose  the  analyses  that  will  be 
performed.  This  selection  of  the  initial  analyses  constitutes  Step  One.  The  process  of  selecting 
these  analyses  will  depend  heavily  on  the  number  of  analyses  that  will  be  performed  using 
eaeh  data  souree. 

This  guide  is  intended  for  the  situation  where  eaeh  data  souree  can  provide  dozens  or 
hundreds  of  analyses.  In  sueh  a  situation,  a  eommon  rule  of  thumb  for  the  number  of 
initial  analyses  is  \0d  for  eaeh  data  souree,  where  d  is  the  number  of  dimensions  (i.e.,  free 
parameters)  being  investigated  simultaneously.  [5  8, 106] 

Onee  the  number  of  initial  analyses  has  been  seleeted,  the  user  must  determine  which 
analyses  to  run.  The  analyses  should  be  ehosen  to  maximize  the  knowledge  gained.  If 
expert  knowledge  is  available  for  the  problem  at  hand,  it  may  be  easy  to  identify  the  most 
useful  analyses.  On  the  other  hand,  if  the  response  behavior  is  not  known  well,  it  may  be 
better  to  take  a  more  universal  approaeh:  spaee-filling  sampling.  [170] 

Spaee-filling  sampling  is  a  way  of  choosing  analyses  whieh  attempts  to  spread  those 
analyses  out  as  much  as  possible.  This  approach  will  roughly  illustrate  how  the  response 
varies  throughout  the  design  spaee.  Latin  hypereubes  are  the  most  eommon  form  of  spaee- 
fdling  sampling,  but  other  teehniques  sueh  as  Sobol  sequenees  are  beeoming  more  popular. 
Reviews  of  many  available  techniques  ean  be  found  in  the  works  of  Chen  et  al.  and  Shan  & 
Weng.[27,  174,  189] 

If  multiple  data  sourees  will  be  available  and  the  user  plans  to  apply  a  multi-fidelity 
method  (see  Seetion  A. 5. 3),  it  may  be  worthwhile  to  take  this  into  aeeount  when  seleeting 
the  analyses  to  be  performed.  Researehers  have  proposed  a  variety  of  ways  to  design  a  set 
of  samples  to  make  it  highly  eompatible  with  multi-fidelity  modeling.  These  approaehes 
to  analysis  design  inelude  methods  sueh  as  nested  or  slieed  Latin  hypercubes. [95,  151,  152, 
153] 
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Alternatively,  it  may  not  be  plausible  to  expeet  dozens  or  hundreds  of  results  from  eaeh 
data  source.  This  is  common  when  physical  measurements  are  made,  as  acquiring  those 
results  is  often  much  more  expensive  than  a  computational  analysis.  In  this  case,  another 
approach  is  necessary.  Instead  of  spreading  the  expensive  samples  out,  emphasis  should  be 
placed  on  identifying  the  most  useful  samples  to  be  analyzed.  The  process  of  identifying 
useful  analyses  will  depend  on  two  sources  of  knowledge:  validation  results,  aa&data 
from  cheaper  analyses. 

The  results  from  the  validation  tests  in  Step  Zero  are  useful  for  a  number  of  reasons. 

First,  the  user  can  identify  the  situations  where  the  cheaper  data  source  can  meet  the 
user’s  accuracy  requirements.  In  those  situations,  the  more  expensive  data  source  may 
not  be  necessary  at  all.  Conversely,  the  user  can  identify  situations  where  the  cheaper 
source  of  data  is  expected  to  have  particularly  poor  accuracy,  and  thus  where  the  cheaper 
source  should  not  influence  the  selection  of  expensive  analyses.  Expert  knowledge  -  in 
particular,  knowing  which  phenomena  affect  the  response  in  each  situation  and  comparing 
those  against  the  phenomena  that  each  data  source  can  capture  -  may  also  help  in  this 
regard. 

The  validation  results  can  also  help  the  user  to  identify  any  consistent  biases  that  may  be 
present  in  the  predictions  of  the  cheaper  data  source.  These  biases  are  then  used  to 
partially  “correct”  the  predictions  of  the  cheaper  data  source.  Once  this  is  done,  the  cheaper 
data  source  is  used  to  explore  the  design  space  and  identify  promising  regions.  For 
example,  if  the  goal  is  to  find  regions  of  the  design  space  where  the  “true”  response  is 
close  to  zero,  and  the  cheap  data  source  over-estimates  the  response  by  an  average  of  5 
units,  the  first  few  expensive  analyses  should  be  placed  in  regions  where  the  cheap  data 
source  predicts  the  response  is  close  to  5.  This  is  a  very  rough  approach  to  sample 
selection,  but  it  can  be  a  good  way  to  get  started. 

A.  5  Training  Surrogate  Mode  is 

Step  Two  is  to  train  one  or  more  surrogate  models.  Surrogate  modeling  is  way  to  estimate 
the  behavior  of  a  response  using  mathematical  techniques.  Once  a  surrogate  model  is 
trained,  it  is  very  computationally  cheap  to  predict  the  value  of  the  response  for  some  new 
set  of  input  values.  This  is  particularly  important  for  design  space  exploration,  which  is  the 
process  of  investigating  the  response  value  for  different  combinations  of  input  values,  as 
exploration  can  require  a  large  number  of  analyses. 

A  number  of  different  surrogate  modeling  techniques  have  been  described  in  the  literature, 
including  Response  Surface  Methods,  Kriging,  Artificial  Neural  Networks,  Radial  Basis 
Functions,  and  Gaussian  Process  Models.  Reviews  of  many  available  techniques  and 
their  relative  merits  can  be  found  in  the  works  of  Chen  et  al.  and  Shan  &  Weng.[27,  174, 
189]  This  guide  will  assume  the  use  of  Kriging.  DACE,  a  useful  Kriging  toolbox  for 
Matlab,  can  be  downloaded  from  http://www2.imm.dtu.dk/~hbni/dace/. 

Kriging  is  used  in  this  guide  primarily  because  it  can  explicitly  account  for  uncertainty  in 
the  data.  When  fitting  a  Kriging  model,  a  covariance  matrix  is  used  to  represent  the 
relationships  between  the  training  samples.  If  uncertainty  is  present  in  the  data,  that 
uncertainty  can  be  captured  by  adding  “nuggets”  to  the  diagonal  of  the  covariance  matrix. 
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The  DACE  toolbox  presently  does  not  allow  user-defined  nugget  values,  but  that  is  easy  to 
remedy 

If  there  is  no  signifieant  uneertainty  present  in  the  data,  the  reader  ean  skip  Sections 
A. 5.1  and  A. 5 .2. 

A.5.1  Implementing  Nuggets  in  DACE 

This  implementation  allows  the  user  to  specify  nugget  values  while  minimizing  changes 
to  the  existing  DACE  functions.  As  a  result,  the  command  sequence  to  make  a  Kriging  model 
with  nuggets  is  inelegant  but  functional,  as  will  be  demonstrated  in  Section  A.5.2. 

first,  create  a  duplicate  copy  of  dacefitm  named  dacefitNuggetm.  Change  line  1  of 
dacefitNuggetm  from: 

function  [dmodel,  perf]  =  dacefit(S,  Y,  regr,  corr,  thetaO,  lob,  upb) 
to: 

function  [dmodel,  perf]  =  dacefitNugget(S,  Y,  regr,  corr,  thetaO,  lob,  upb,  nug) 

This  tells  the  modified  function  to  expect  an  extra  input  parameter,  which  will  be  the 
nugget  value  or  values.  Next,  change  line  93  from: 

‘D’,D,  ‘ij:  ij,  ‘scS:sS); 

to: 

‘D  ’,  D,  ‘ij  ’,  ij,  ‘scS  ’,  sS,  ‘nuggets  ’,  nug); 

This  incorporates  the  nugget  parameter  into  the  structure  par  which  contains  the  data 
necessary  to  evaluate  how  well  the  Kriging  model  fits  the  data.  Next,  modify  line  114  from: 

‘C’,fit.C,  ‘Ft’, fit. Ft,  ‘G’,fit.G); 

to: 

‘C’,fit.C,  ‘Ft’, fit. Ft,  ‘G’,fit.G,  ‘nuggets’,  nug); 

This  adds  the  nugget  values  to  the  structure  dmodel,  which  is  the  trained  Kriging  model 
that  is  returned  by  the  function.  Adding  the  nugget  values  that  were  used  to  train  the  model 
is  not  strictly  necessary,  but  may  be  helpful  when  documenting  the  Kriging  model.  Easily, 
change  line  129  from: 

[r(idx;  ones(m,l)+muj); 

to: 

[ r( idx);  ones(m,l)+mu  +par.  nuggets] ); 
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This  adds  the  nugget  values  to  the  diagonal  elements  of  the  covariance  matrix.  No  other 
functions  need  to  be  changed,  as  nuggets  only  have  a  direct  effect  on  the  surrogate  training 
process.  Predictions  made  with  the  resulting  Kriging  model  will  take  the  nuggets  into 
account  without  any  further  modifications. 

A.5.2  Using  the  Modified  DACEFIT  Function 

With  the  standard  dacefit.m  function,  the  user  can  fit  a  model  using  the  command; 

Dmodel  =  dacefit(S,  Y,  TrendType,  Correlation,  theta,  lob,  upb); 

Here,  S  is  the  matrix  of  input  values;  Y  is  the  vector  or  matrix  of  response  values; 
TrendType  specifies  whether  the  underlying  trend  of  the  model  is  constant,  linear  or 
quadratic  (or  in  terms  of  the  functions  supplied  with  the  DACE  toolbox,  @regpolyO, 
@regpolyl  or  @regpoly2);  and  Correlation  specifies  the  desired  correlation  model  (e.g., 
exponential,  Gaussian,  linear,  etc.,  again  referring  to  the  functions  included  with  DACE  such 
as  @correxp).  The  remaining  parameters  represent  the  initial  guess  for  correlation 
coelficient  values  {theta)  and  the  limits  on  those  coelficients:  lob  is  the  lower  bound  while 
upb  is  the  upper  bound. 

The  first  step  of  creating  a  Kriging  model  using  nuggets  is  to  determine  the  value  of  the 
nugget.  The  nugget  can  be  a  scalar  or  a  vector.  If  the  uncertainty  is  constant  for  every 
training  sample,  the  nugget  will  be  a  scalar  equal  to  the  variance  of  that  uncertainty.  If  the 

uncertainty  is  different  for  each  training  sample,  the  nugget  will  be  a  vector  of  size  (n  X  1 ), 
where  n  is  the  number  of  training  samples,  and  the  entry  will  be  the  variance  of  the 
uncertainty  in  the  response  for  the  training  sample. 

Note  that  before  it  can  be  used  to  fit  a  Kriging  model,  the  nugget  must  be  scaled  by  the 
process  variance  of  the  Kriging  model.  The  process  variance  is  calculated  by  DACE,  so  the 
set  of  commands  that  fit  a  Kriging  model  using  nuggets  is  almost  recursive: 

dmodel jioiseless  =  dacefit(S,  Y  TrendType,  Correlation,  theta,  lob,  upb); 

scaled jiugget  =  original _nugget  /  dmodel  jioiseless. sigma! ; 

dmodel  =  dacefitNugget(S,  Y  TrendType,  Correlation,  theta,  lob,  upb,  scaledjiugget); 

The  resulting  dmodel  is  a  Kriging  model  which  captures  the  uncertainty  specified  in  the 
nugget  value(s).  The  commands  to  use  dmodel  to  estimate  response  values  and  prediction 
confidence  are  identical  to  those  for  a  nugget-less  Kriging  model,  and  are  given  in  the 
manual  that  is  included  with  the  DACE  toolbox  download. 

A.5.3  Multi-Fidelity  Surrogate  Modeling 

If  the  reader  will  only  be  using  one  data  source,  this  section  may  be  safely  skipped.  If 
multiple  data  sources  will  be  available,  it  may  be  possible  to  combine  data  from  each 
source  into  a  single  surrogate  model.  This  new  surrogate  is  often  more  accurate  than  if 
only  the  cheaper  data  source  were  available,  yet  less  expensive  than  if  only  the  more 
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expensive  data  souree  were  used.  The  methods  used  to  ereate  such  surrogates  are  known  as 
multi-fidelity  methods. 

Multi-fidelity  methods  enable  the  user  to  train  surrogate  models  that  emulate  the  more 
expensive  data  source  while  reducing  the  number  of  expensive  analyses  necessary  to  achieve 
the  desired  accuracy.  Rather  than  having  to  infer  response  behavior  purely  from  the  results 
of  expensive  analyses,  multi-fidelity  methods  use  cheaper  analyses  to  learn  how  the 
response  varies  throughout  the  design  space.  Those  results  are  then  “corrected”  using  results 
from  the  more  accurate  (but  more  expensive)  data  source. 

For  example,  consider  the  problem  of  predicting  the  lift  coefficient  of  a  vehicle  at  multiple 
angles  of  attack.  Simple  linear  aerodynamics  tools  can  often  estimate  how  the  lift 
coefficient  changes  with  angle  with  good  accuracy  (at  least  for  small  angles)  but  may  be 
less  accurate  when  predicting  the  lift  coefficient  at  any  one  particular  angle.  More 
complex  methods  like  computational  fluid  dynamics  (CFD)  offer  better  predictions  of  lift 
coefficient  but  require  a  much  larger  computational  effort.  Multi-fidelity  methods  would 
use  the  cheaper  tools  to  estimate  how  the  response  (e.g.,  lift  coefficient)  changes  with 
angle,  and  correct  the  response  at  each  angle  based  on  one  or  two  results  from  the  more 
accurate  CFD  analysis. 

The  savings  offered  by  multi-fidelity  methods  will  depend  on  how  much  similarity  there 
is  between  the  behaviors  of  the  response  as  estimated  by  the  two  data  sources.  If  there  is 
very  little  similarity,  multi-fidelity  methods  will  not  offer  much  benefit  -  but  if  there  is 
so  little  similarity,  there  may  be  no  point  in  using  the  cheaper  source  of  data  at  all. 

A  number  of  multi-fidelity  techniques  have  been  developed  and  used  for  engineering 
purposes.  Some  create  two  surrogate  models,  one  to  emulate  the  cheaper  source  of  data 
{fcheapf  and  one  to  “correct”  the  cheaper  result  to  match  the  more  expensive  source  of  data. 
This  type  of  multi-fidelity  technique  is  very  popular,  but  assumes  that  there  will  be 
enough  expensive  results  {/expensive)  to  create  a  surrogate  model  (preferably  >  lOJ,  where 
d  is  the  number  of  independent  parameters).  This  family  of  methods  includes: 


Additive  correction:  one  surrogate  is  trained  to  emulate  f ,  ,  another  is  trained 

C*  J  cheap  ’ 

to  emulate  f  ■  -  f.  ,  and  the  predictions  of  both  are  added  together  to 

•/  expensive  J  cheap  ^  r  O 

estimate  f  .  .173] 

J  expensive  L  J 

Proportional  correction:  one  surrogate  is  trained  to  emulate  ,  another  is 
f 

J  ex 


trained  to  emulate 


expensive 


,  and  the  predictions  from  both  surrogates  are  multiplied 


/cheap 

together  to  produce  an  estimate  for  -[9] 

Hybrid  correction:  a  combination  of  additive  and  proportional  correction. [94,  155] 
Ghoreyshi  cokriging:  one  surrogate  is  trained  to  emulate  ;  the  estimated 

/cheap  value  is  then  treated  as  an  extra  input  parameter  when  fitting  a  surrogate  to 


^  If  the  cheaper  source  of  data  is  very  cheap,  it  may  be  possible  to  use  it  directly  instead  of  creating  a 
surrogate  model. 
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expensive  ' 


Alternatively,  some  multi-fidelity  teehniques  ereate  only  one  surrogate  model,  using  all 
the  available  data  at  onee.  Beeause  the  efibrt  of  training  a  surrogate  model  inereases  as 
more  data  is  used,  these  methods  may  beeome  quite  eomputationally-intensive  for  large 
data  sets.  On  the  other  hand,  if  the  user  will  not  have  the  resourees  to  run  >  \0d  analyses 
with  the  expensive  data  source,  these  methods  can  still  produce  a  surrogate  that 
incorporates  all  available  data.  These  techniques  include: 

Cokriging:  a  form  of  Kriging  that  can  handle  multiple  responses  (or  the  same  response 
calculated  by  multiple  data  sources).  [  1 82]  Some  software  tools  are  available  that  can 
perform  cokriging,  but  most  can  only  handle  up  to  three  independent  parameters. [145,  164] 

Data  harmonization:  similar  to  additive  correction,  data  harmonization  attempts  to 
capture  any  biases  between  data  sources.  The  source  of  each  sample  is  identified  using 
binary 

columns. [13,  14,  15] 

A.5.4  Demonstrating  Ghoreyshi  Cokriging 

Ghoreyshi  cokriging  was  selected  for  a  more  in-depth  demonstration,  as  it  produced  the 
most  accurate  surrogate  models  for  a  particular  test  problem.  Note  that  the  effectiveness  of 
each  multi-fidelity  method  is  strongly  problem-dependent,  and  although  Ghoreyshi 
cokriging  produced  the  most  accurate  surrogate  model  for  one  application,  another 
method  might  be  superior  for  a  different  application.  If  possible,  the  user  should  do 
comparison  tests  to  determine  which  method  is  best  for  the  problem  at  hand.  The  process 
of  evaluating  surrogate  model  accuracy  is  the  subject  of  the  next  section  of  this  guide. 

To  understand  the  implementation  of  Ghoreyshi  cokriging,  consider  a  problem  with  two 
input  variables,  xi  &  X2.  The  matrix  of  inputs  for  a  standard  (non-multi-fidelity)  Kriging 
model  with  a  linear  underlying  trend  would  resemble: 


■^u 

■^2,1 

N,2 

■^2,2 

■^1,3 

■^2,3 

N,4 

■^2,4 

(1) 


Here,  Xj  j  represents  the  value  of  x^  for  the  first  training  sample.  The  matrix  of  inputs  for  a 

Ghoreyshi  cokriging  model  with  a  linear  underlying  trend,  on  the  other  hand,  would  take 
the  form: 


^1,1 

•^2,1 

f  cheap, \ 

•^1,2 

X22 

f cheap, 2 

^1,3 

•^2,3 

f cheap, 2 

^1,4 

•^2,4 

f cheap, ^ 

(2) 
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where  f cheap, i  represents  the  estimated  response  (as  determined  by  the  cheaper  data  source) 

•th 

at  the  l  sample.  Thus,  a  problem  with  d  input  dimensions  becomes  one  with  d  +  l  input 
dimensions.  The  process  of  fitting  the  Kriging  model  (or  any  other  form  of  surrogate  model) 
is  otherwise  unchanged.  When  using  the  DACE  toolbox  for  Matlab,  a  Ghoreyshi  cokriging 
model  may  be  created  using  the  following  commands: 

dmodel_cheap  =  dacefit(S_cheap,  f_cheap,  TrendType,  Correlation,  theta,  lob,  upb); 

Here,  S  cheap  is  the  matrix  of  input  values  that  were  analyzed  with  the  cheaper  data  source, 
while  /cheap  is  the  vector  or  matrix  of  response  values  from  the  cheap  data  source. 
TrendType  specifies  whether  the  underlying  trend  of  the  model  is  constant,  linear  or 
quadratic.  Correlation  specifies  the  desired  correlation  model  (e.g.,  exponential,  Gaussian, 
linear,  etc.).  The  remaining  parameters  represent  the  initial  guess  for  correlation  coelficient 
values  {theta)  and  the  limits  on  those  coelficients:  lob  is  the  lower  bound  while  upb  is  the 
upper  bound. 

Next,  a  smaller  set  of  analyses  {S_expensive)  is  performed  using  the  more  expensive  data 
source,  and  the  surrogate  model  of  the  cheap  data  source  {dmodel_cheap)  is  used  to 
estimate  the  response  values  that  would  be  obtained  if  the  same  smaller  set  of  analyses  were 
performed  using  the  cheaper  data  source.  The  estimated  cheaper  response  values 
{/cheap _pred)  are  obtained  with  the  command: 

/cheap _pred  =  predictor (S_expensive,  dmodel_cheap); 

Finally,  a  Ghoreyshi  cokriging  model  is  created  using  both  the  cheap  and  expensive  data 
by  including  the  cheaper  response  values  as  an  extra  input  parameter: 

dmodel_expensive  =  dace/it(  [S_expensive /cheap _pred] ,  /_expensive,  TrendType, 
Correlation,  theta_extra,  lob_extra,  upb_extra); 

Note  that,  because  there  is  one  more  input  parameter,  the  vectors  for  theta,  lob,  and  upb 
must  each  be  enlarged  by  one  entry.  theta_extra,  lob_extra  and  upb_extra  represent  those 
enlarged  vectors. 

To  predict  the  expensive  response  for  some  new  sample  using  this  Ghoreyshi  cokriging 
model,  the  user  must  also  predict  the  cheap  response  for  that  sample: 

/cheap jiewsample  =  predictor (Sjiew,  dmodeljcheap); 

/expensive _newsample  =  predictor (  [S_new /cheap jiewsample] ,  dmodel_expensive); 

Here,  /expensive  jiewsample  is  the  prediction  of  the  Ghoreyshi  cokriging  model  for  the 
response  value  for  the  new  sample. 

At  this  point  in  the  process,  initial  samples  have  been  selected  and  analyzed  using  the 
various  data  sources.  Using  the  results  from  those  analyses,  surrogate  models  have  been 
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trained  to  estimate  the  response(s)  of  interest.  The  question  may  then  be  raised:  are  those 
surrogate  models  accurate  enough? 

A.6  Evaluating  Surrogate  Model  Accuracy 

Step  Three  is  to  quantify  the  predictive  accuracy  of  the  surrogate  models  that  have  been 
trained.  Evaluation  of  surrogate  model  accuracy  is  typically  done  in  one  of  two  ways: 
cross  validation  or  test  samples. [88,  148] 

A.6.1  Test  Samples 

Evaluation  using  test  samples  is  simpler  conceptually,  but  requires  more  data  than  cross 
validation.  To  evaluate  surrogate  model  accuracy  using  test  samples,  some  number  of 
analysis  results  are  set  aside  and  not  used  to  train  the  surrogate  model.  The  more  test 
samples  available  for  this  purpose,  the  more  confidence  the  user  can  have  in  the  prediction 
error  estimate  for  the  surrogate.  If  the  user  is  only  concerned  with  the  predictive  accuracy 
for  samples  with  response  values  within  a  certain  range,  only  samples  with  responses 
within  that  range  should  be  used  for  testing;  obtaining  useful  test  samples  in  this  scenario 
may  require  extra  effort,  such  as  a  separate  optimization  process. 

Once  the  surrogate  has  been  trained,  it  is  used  to  predict  the  response  value  for  each  of  the 
test  samples.  These  predictions  are  then  compared  with  the  observed  results  to  calculate  the 
prediction  error  for  the  test  set.  Eor  each  sample  in  the  test  set,  the  prediction  error  ( e. )  is 

the  discrepancy  between  the  observation  ( y. )  and  the  prediction  ( y. ): 

(3) 

Once  the  prediction  errors  for  all  n  samples  in  the  test  set  have  been  calculated,  they  can 
be  used  to  evaluate  the  overall  accuracy  of  the  surrogate  model.  Eirst,  the  average 
prediction  error  ( )  is  calculated: 


1  ” 

n  ,=i 


e,,  =- 


(4) 


This  value  indicates  whether  there  is  any  consistent  bias  in  the  predictions  of  the  surrogate, 
and  ideally  should  be  close  to  zero. 

Next,  the  spread  of  the  prediction  errors,  ,  is  calculated  to  determine  how  precise  the 

surrogate  model  is,  i.e.,  how  much  variability  is  present  in  the  prediction  error.  [10]  A 
model  with  some  bias  that  is  consistently  within  5  percent  of  the  correct  value  may  be  more 
useful  than  a  model  that  has  no  average  bias  but  is  occasionally  off  by  50  percent.  To 
assess  this,  the  spread  of  the  error  is  first  calculated: 
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Here,  has  units  of  variance,  i.e.,  the  units  of  the  response  squared,  which  may  be 
difficult  to  interpret.  The  square  root  of  may  be  easier  to  work  with: 


RMSE,^=  I -X(e,y  (6) 

This  term,  RMSE^^ ,  is  the  Root  Mean  Squared  Error  and  has  the  same  units  as  the  response 
of  interest.  A  small  RMSE^^  value  indicates  that  the  surrogate  model  prediction  errors  did 
not  exhibit  large  variations. 

By  combining  the  average  prediction  error  ( )  and  the  deviation  of  the  prediction  error 
( RMSE^^ ),  the  user  can  estimate  a  confidence  interval  for  a  future  prediction  y{x)  .  First, 
the  expected  bias  is  accounted  for: 

Kx)u„bia.ed  =Kx)  +  e,, 

(7) 

=  Kx)  +  - - 

n 

Next,  assuming  the  prediction  error  is  normally  distributed,  a  95  percent  confidence 
interval  can  be  estimated: 


T(^)w95%  =  Kx)unbiased-^^RMSE,^ 

Kx)upper9b%  =  Kx)unbiased  +  2  X  RMSE^^ 

These  values  let  the  user  estimate  how  much  uncertainty  is  present  in  the  predictions  of  the 
surrogate  model.  The  user  may  assume  that,  in  light  of  the  available  data,  there  is  a  95 
percent  likelihood  that  the  actual  response  y(x)  falls  somewhere  between  yi^)iower95% 

y(.^\pper95%  ' 


The  argument  against  the  use  of  test  samples  for  accuracy  evaluation  is  that  some 
analytical  effort  -  the  effort  required  to  analyze  the  test  samples  -  does  not  contribute  to 
improving  the  surrogate  model.  If  the  per-analysis  cost  is  very  high,  this  may  be  considered 
too  wasteful.  looss  proposed  a  way  to  distribute  the  test  samples  throughout  the  design 
space  so  that  the  user  can  get  the  most  accurate  assessment  of  surrogate  model  accuracy  for 
the  least  possible  number  of  test  samples,  although  it  is  difficult  to  know  in  advance  how 
many  test  samples  will  be  required. [87] 

A.6.2  Cross  Validation 

The  alternative  to  using  separate  test  samples  is  cross  validation.  In  cross  validation,  all  the 
available  data  is  used  to  create  the  primary  surrogate  model.  This  primary  surrogate  is 
then  set  aside.  Then,  the  available  data  is  split  up  into  k  groups,  where  k  is  some  integer.  All 
but  one  group  of  data  are  used  to  train  a  new  surrogate  model,  and  that  surrogate  is  used  to 
predict  the  response  values  for  the  samples  that  were  omitted.  The  resulting  predictive 
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errors  are  used  to  infer  the  predietive  aeeuraey  of  the  original  surrogate  model.  This  proeess 
is  known  as  “k-fold  eross-validation”. 

As  with  the  test-samples  approaeh,  if  the  user  is  trying  to  quantify  the  predietive  aeeuraey  of 
the  surrogate  for  samples  where  the  response  value  lies  within  a  specified  range,  the  eross 
validation  proeedure  may  be  performed  by  progressively  omitting  only  the  samples  whieh 
have  response  values  in  that  range.  If  most  of  the  training  samples  do  not  have  response 
values  in  the  range  of  interest,  the  estimate  for  predietion  error  may  itself  have  low 
confidenee. 

The  best  group  size  to  use  is  a  matter  of  some  debate  in  the  literature.  The  limiting  ease  is 
to  leave  out  one  sample  each  time.  For  a  data  set  of  N  samples,  N  new  models  would  be 
have  to  be  created,  each  of  which  omits  the  training  sample  and  then  attempts  to 
predict  the  response  at  that  omitted  sample.  The  prediction  error  for  the  cross  validation 
model  -  which  omitted  the  f  *  sample  from  its  training  data  set  -  would  be: 

(9) 

Here  y,  is  the  actual  response  value  for  the  sample,  and  y,  is  the  predicted  response 
using  the  l  cross  validation  model.  The  overall  cross  validation  error,  ,  is  calculated  as: 

(10) 

n  ,.=i 

Like  has  units  of  variance,  so  another  calculation  is  done  to  make  the  results 

easier  to  interpret: 


RMSE^^=  I -±{e,f  (11) 

V « i=i 

This  term,  RMSE^^ ,  is  the  Root  Mean  Squared  Error  and  has  the  same  units  as  the 
response  of  interest.  A  small  RMSE^^  value  indicates  that  the  surrogate  model  prediction 
errors  were  consistent  and  did  not  exhibit  large  variations. 

As  with  the  test  samples  approach,  the  user  can  estimate  a  confidence  interval  for  a  future 
prediction  y(x)  using  the  average  prediction  error  ( )  and  the  deviation  of  the  prediction 
error  ( RMSE^).  First,  the  expected  bias  is  accounted  for: 

K^)  unbiased  =  >’(^)  +  «,, 

(12) 

=  yix)  +  ^ - 

n 

Next,  assuming  the  prediction  error  is  normally  distributed,  a  95  percent  confidence 
interval  can  be  estimated: 
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Kx)lo^er95%  =  K^) 
Kx)unper95%  =  K^) 


unbiased 


unbiased 


-IxRMSE^ 
+  2xRMSE„ 


(13) 


As  before,  the  user  may  assume  that,  in  light  of  the  available  data,  there  is  a  95  pereent 


likelihood  that  the  actual  response  at  x  ,  y{x) ,  falls  somewhere  between  y{x) 

y^^)upper9i%  • 


lower95% 


and 


This  approach  can  provide  a  reasonably  good  estimate  of  the  prediction  error  of  the 
surrogate  trained  with  all  the  samples,  but  there  are  three  points  of  concern.  First,  the  user 
must  train  N  extra  surrogate  models,  which  may  become  very  time-consuming  for  larger 
data  sets.  [5  8]  Secondly,  if  certain  training  samples  have  a  large  effect  on  the  surrogate 
model,  there  may  be  a  large  spread  in  the  cross  validation  prediction  errors,  which  means 
there  may  be  a  lot  of  uncertainty  in  the  cross  validation  error  estimate.  Finally,  cross 
validation  only  evaluates  the  accuracy  of  surrogate  models  trained  with  some  of  the  data,  so 
the  true  predictive  accuracy  of  the  model  may  be  under-estimated. 

In  some  cases,  splitting  the  data  up  into  5  or  10  larger  groups  may  address  some  of  those 
concerns.  By  omitting  each  group  one  at  a  time  for  cross  validation  instead  of  each  sample 
individually,  the  number  of  extra  surrogate  models  that  must  be  trained  is  reduced  from  N 
to  5  or  10.  If  the  data  pool  is  large  enough  this  may  also  result  in  a  smaller  spread  of 
prediction  errors,  reducing  the  uncertainty  in  the  cross  validation  error  estimate. [100] 
However,  if  the  amount  of  data  is  limited,  the  removal  of  10-20  percent  of  the  data  for  k- 
fold  cross  validation  may  significantly  affect  the  resulting  surrogate  models,  leading  to  a 
substantial  over-prediction  of  surrogate  error.[75]  The  user  may  screen  for  this  effect 
during  cross  validation  by  tracking  the  model  parameters,  such  as  correlation  coefficients 
(e.g.,  the  theta  parameters  used  by  the  DACE  toolbox),  for  each  new  surrogate  produced. 

If  these  parameters  vary  significantly,  the  user  may  wish  to  reduce  k  to  split  the  data  into 
smaller  groups. 

A.6.3  Emphasizing  Certain  Response  Value  Ranges 

If  the  user  is  not  attempting  to  create  a  surrogate  model  that  is  accurate  over  the  entire 
range  of  the  response,  but  rather  wishes  to  emphasize  a  certain  range  of  response  values, 
the  estimation  of  predictive  accuracy  becomes  slightly  more  complex.  This  was  noted 
briefly  in  the  previous  sections. 

In  order  to  quantify  the  predictive  error  of  the  surrogate  model  for  samples  with  response 
values  within  a  certain  range,  the  user  should  only  use  samples  that  fall  within  the  range  of 
interest  when  selecting  test  samples  or  cross  validation  groups.  By  using  only  those  samples, 
the  user  ensures  that  the  estimated  confidence  intervals  are  as  relevant  as  possible  to  the 
expected  use  of  the  surrogate.  Depending  on  the  behavior  of  the  response  and  the  size  of  the 
design  space,  however,  there  may  not  be  many  samples  available  which  fall  within  the 
desired  range. 

In  such  a  scenario,  the  user  is  faced  with  tough  options: 

•  Work  With  What’s  Available.  The  user  may  simply  go  ahead  and  use  the  available 
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samples  within  the  range  of  interest.  If  very  few  such  samples  are  available,  this 
approach  may  result  in  error  estimations  that  have  high  uncertainty.  However,  if  the 
uncertainties  of  the  error  estimates  are  reported  along  with  the  predictions,  that  result 
may  be  good  enough. 

•  Use  “Close-Enough”  Data.  If  samples  are  available  that  have  response  values  close 
to  the  range  of  interest,  the  user  may  also  include  those  samples  when  performing 
the  predictive  accuracy  calculations.  If  the  extra  samples  are  close  to  the  range  of 
interest,  this  may  be  a  very  effective  option.  However,  there  is  no  guarantee  that 
good  predictive  accuracy  for  samples  close  to  the  range  of  interest  will  correspond 
to  good  accuracy  for  samples  within  the  range  of  interest:  picture  a  zoologist 
learning  about  lions  by  studying  housecats. 

•  Get  More  Data,  The  user  may  attempt  to  gather  more  results  in  hopes  some  of  the 
new  data  will  fall  within  the  range  of  interest.  This  may  be  expensive  or  infeasible, 
depending  on  the  time  and  effort  that  would  be  required. 

If  the  user  has  the  time  and  resources  to  obtain  more  data,  the  third  option  is  preferable 
as  it  offers  the  greatest  reduction  in  uncertainty.  If  separate  test  samples  are  used  for  accuracy 
evaluation,  an  optimization  approach  may  be  the  best  way  to  accumulate  samples  within 
the  range  of  interest.  On  the  other  hand,  if  cross  validation  is  used,  any  technique  that 
identifies  samples  within  the  range  of  interest  can  be  used,  as  those  samples  will  benefit 
both  the  accuracy  estimation  process  and  the  accuracy  of  the  surrogate  model  itself  One 
such  technique  is  described  in  Section  A.7  . 

A.6.4  Stopping  Criteria 

There  are  multiple  reasons  for  the  user  to  decide  that  the  surrogate  model  is  “good  enough” 
and  end  the  process.  The  trivial  reason  is  that  all  resources  have  been  expended  -  there  is 
no  more  time  or  budget  left  to  run  more  analyses  or  train  new  surrogate  models.  This  result 
probably  would  not  be  satisfying  for  anyone  involved. 

The  literature  mostly  assumes  that  the  user  will  know  what  “good  enough”  means  for 
themselves  for  each  application.  Most  of  the  stopping  criteria  that  have  been  published  are 
intended  for  use  with  optimizations,  such  as  stopping  when  the  expected  improvement  is  less 
than  some  threshold  or  when  no  improvements  have  been  made  after  some  number  of 
cycles.  Fortunately,  some  guidelines  for  surrogate  model  accuracy  for  non-optimization 
purposes  have  been  identified.  [5  7] 

As  stated  in  the  previous  section.  Root  Mean  Squared  Error  has  the  same  units  as  the 
response  being  modeled.  The  calculated  RMSE  value  can  be  normalized  by  the  useful 
range  of  the  response  being  modeled.  This  quantifies  the  prediction  error  relative  to  the 
range  of  the  response.  If  the  user  is  interested  in  the  global  behavior  of  the  response,  this 
range  may  be  calculated  using  the  largest  and  smallest  observed  response  values.  Eor 
example,  if  the  response  has  a  maximum  value  of  20  and  a  minimum  value  of  7,  the  RMSE 
would  be  normalized  by  a  factor  of  13.  If,  on  the  other  hand,  only  a  certain  range  of  the 
response  is  of  interest  (for  example,  the  user  is  only  interested  in  pitching  moment 
coelficients  between  -0.1  and  0.1),  the  RMSE  would  be  normalized  using  this  range  of 
interest  (0.2,  in  this  example).  Note  that  when  normalizing  by  a  range  of  interest,  the  RMSE 
should  be  calculated  using  only  samples  that  fall  within  that  range  of  interest. 


Ill 

Approved  for  public  release;  distribution  unlimited 


This  normalized  RMSE  value  ean  give  the  user  a  rough  estimate  of  the  predietive  aeeuraey 
of  the  surrogate.  As  a  guideline,  a  model  with  normalized  RMSE  of  less  than  10  pereent 
is  eonsidered  to  be  “reasonable,”  while  one  with  normalized  RMSE  of  less  than  2  pereent  is 
eonsidered  “very  good.”[57]  These  guidelines  ean  help  the  user  determine  whether  the 
eurrent  model  is  aeeeptable  or  if  it  will  require  additional  investment,  sueh  as  the 
aequisition  of  more  training  data.  Sueh  aequisition  is  the  subjeet  of  the  next  step  in  this 
guide. 

A.  7  Selecting  New  Analyses  Based  On  Previous  Results 

This  step  in  the  method.  Step  Four,  assumes  that  the  available  surrogates  are  not  aeeurate 
enough,  and  that  there  are  suflheient  resourees  available  to  perform  new  analyses  to  aequire 
more  data.  To  be  most  effeetive,  the  analyses  will  be  seleeted  based  on  the  results  of  previous 
analyses,  allowing  them  to  emphasize  portions  of  the  design  spaee  that  are  of  interest  to  the 
user.  This  proeess  of  selecting  the  most  useful  new  analyses  based  on  previous  results  is 
known  as  “adaptive  sampling.” 

Most  of  the  researeh  that  has  been  done  with  adaptive  sampling  has  done  so  in  the 
eontext  of  optimization,  i.e.,  maximizing  or  minimizing  some  objeetive  funetion.[38,  83, 
111,  173]  The  present  teehnique  is  intended  for  problems  where  a  certain  range  of 
response  values  is  of  interest,  as  opposed  to  maximizing  or  minimizing  something,  so  a  less- 
eommon  approaeh  to  sample  seleetion  is  neeessary.  The  preferred  approaeh  in  this  ease  is 
eontour-based  sampling.  [149] 

Contour-based  sampling  evaluates  potential  new  samples,  known  as  “eandidates,”  by 
estimating  how  eaeh  eandidate  would  affeet  the  predietive  eonfidenee  of  the  surrogate  model 
if  it  were  added  to  the  data  pool.  Speeifieally,  the  eontour-based  sampling  algorithm 
attempts  to  identify  the  eandidate  whieh  would  produee  the  largest  reduction  in  prediction 
uncertainty  for  samples  with  responses  within  the  specified  ranges  of  interest. 

To  quantify  the  reduetion  in  prediction  uncertainty,  test  samples  are  used.  Unlike  in 
model  validation,  the  test  samples  do  not  need  to  be  analyzed  in  contour-based  sampling. 
Instead,  the  surrogate  model  is  modified  as  if  the  candidate  had  been  added  to  the  training 
pool,  and  then  the  modified  surrogate  is  used  to  estimate  the  prediction  uncertainty  at  each 
test  sample.  These  uncertainties  are  then  combined  in  a  weighted  sum.  If  a  test  sample 
has  a  high  likelihood  of  having  a  response  value  within  the  specified  range  of  interest,  it 
will  be  weighted  heavily  (on  the  order  of  1);  if  the  test  sample  is  unlikely  to  fall  within  the 
range  of  interest,  it  will  be  weighted  lightly  (on  the  order  of  0).  The  algorithm  is  based  on 
Kriging,  since  that  surrogate  modeling  approach  allows  the  user  to  quickly  and  easily 
estimate  the  predictive  confidence  at  any  sample  in  the  design  space. 

Eor  this  guide,  it  will  be  assumed  that  there  are  multiple  responses  of  interest,  and  the 
user  wishes  to  accurately  identify  samples  for  which  the  value  of  every  response  falls  within 
some  user-defined  range  of  interest.  This  section  will  use  the  symbol  Q  to  refer  to  the  total 
number  of  responses  that  are  being  taken  into  account  when  selecting  new  samples  to 
analyze. 
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A.7.1  Generating  Candidate  &  Test  Samples 

The  first  step  in  choosing  a  new  sample  with  contour-based  sampling  is  to  generate 
candidate  samples.  This  is  done  in  the  same  way  that  the  initial  samples  were  chosen  in 
Section  A.4;  a  space-filling  distribution  is  probably  the  simplest  and  most  universal 
approach.  The  number  of  candidates  generated  is  left  up  to  the  user.  Evaluating  more 
candidates  means  that  the  selection  algorithm  will  have  more  options  to  choose  from,  but 
will  increase  the  amount  of  computational  effort  required  to  choose  a  new  sample.  A  good 
starting  value  may  be  \0d  candidates,  where  d  is  the  number  of  free  parameters. 

Next,  a  set  of  test  samples  is  generated.  These  test  samples  will  be  used  to  evaluate  the 
candidates  based  on  the  estimated  response  values  and  prediction  uncertainties.  Most 
likely,  the  test  samples  will  also  be  evenly  distributed  throughout  the  design  space.  The 
more  test  samples  that  are  used,  the  more  accurately  the  algorithm  will  assess  the 
candidates,  but  the  longer  the  selection  process  will  take.  As  a  rough  guide,  \5d  may  be  a 
good  number  of  test  samples,  although  the  actual  best  value  will  depend  strongly  on  the 
problem  being  investigated. 

A.7.2  Filtering  Candidates 

If  desired,  the  user  may  speed  up  the  sample  selection  process  by  ignoring  candidates  that 
have  a  low  likelihood  of  having  responses  within  the  specified  ranges  of  interest.  To 
accomplish  this,  the  user  must  calculate  the  likelihood  that  each  candidate  falls  within  the 
range  of  interest  for  a  given  response.  This  is  done  by  using  the  surrogate  models  to 
estimate  the  response  value  (y)  and  the  prediction  uncertainty  (pmse)  for  each  candidate. 

The  Kriging  prediction  uncertainty  indicates,  roughly,  how  close  to  the  prediction  is 
expected  to  be  to  the  actual  response  value.  This  distribution  is  assumed  to  be  a  Gaussian 
distribution.  As  a  result,  the  probability  that  the  response  is  larger  than  some  target  value 
may  be  calculated  analytically.  Using  that  relationship,  the  likelihood  that  the  response  (y) 
falls  within  certain  limits  iy lower  &.y upper)  can  be  calculated  as: 

P[y  upper  >y>  y  lower  )=  P{y  >  y  lower  )  “  >  y  upper  )  (14) 

P[y  >  )  is  the  likelihood  that  the  predicted  value  y  is  larger  than  some  specified 

threshold  .  Zelen  &  Severo[l]  provide  a  method  for  calculating  such  likelihoods  for  a 

standard  normal  distribution.  To  use  this  method,  the  prediction  from  the  Kriging  model 
must  be  translated  into  a  standard  normal  distribution. 

The  standard  normal  is  a  special  case  of  the  Gaussian,  or  normal,  distribution  for  which  p  , 

the  mean,  is  equal  to  zero  and  cr  ,  the  variance,  is  equal  to  one.  This  is  commonly  written 
as  N(0,1).  Any  normal  distribution  N(/i,cr^)  can  be  transformed  into  a  standard  normal 
distribution  using  the  following  equation: 


(15) 
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Here,  is  the  threshold  value,  sueh  as  i  M  is  the  predicted  response  value  from 

the  surrogate  model;  and  cr  is  the  prediction  uncertainty. 

To  calculate  the  probability  that  is  larger  than  // ,  Z  is  plugged  into  the  equation: 

(D(x)  « 1  -  <I){Z)  (hjt  +  +  b^t'^  +  b^t^ ), 

1  (16) 

\  +  b^Z 

Here,  (t>{Z)  is  the  probability  density  function  (PDF)  of  the  standard  normal  distribution, 
which  expresses  the  probability  that  a  random  draw  from  the  distribution  would  produce  a 
value  of  Z.  The  b  coefficients  each  take  a  different  value:  b^  =  0.2316419  , 

hi  =  0.319381530 ,  b^  =  -0.356563782 ,  b^  =1.781477937  ,  b^  =  -1.821255978 ,  and 
65  =  1.330274429  .  This  approximation  for  0(x)  is  accurate  to  within  <  7.5x10  ^  as  long 

as  y interest  >  ■ 


The  PDF  of  a  normal  distribution,  (t>{Z)  ,  can  be  calculated  by: 


(z-py 

2(1)^ 


(17) 


Note  that  this  relation  for  0(x)  is  only  valid  when  yin,„^^,  is  greater  than  ju  ,  the  predicted 
response.  If  yi„,^nest  Is  less  than  ju ,  this  set  of  equations  will  give  nonsense  answers.  In  these 
situations,  the  user  should  replace  with  -  y interest  and  ju  with  —  ju ,  which  will 

calculate  the  probability  that  is  less  than  ju  ( P(y interest  ^  M)  )•  This  must  then  be 

converted  back  to  /’(y, >  //) : 


Piy interest  >  /«)  =  1  “  ^iy interest  <  M)  (18) 

Using  the  predicted  response  value  ( y  )  and  prediction  uncertainty  ( y^^gj, )  for  a  candidate 
sample,  the  limits  of  the  range  of  interest  for  the  response  ( and  y^pper  )>  and 

Equations  14-18,  the  user  can  calculate  the  likelihood  that  the  true  response  being  emulated 
by  the  surrogate  falls  within  the  range  of  interest.  If  there  are  multiple  responses,  this 
process  is  repeated  for  every  response,  and  the  minimum  value  is  retained  as  the  likelihood 
score,  referred  to  as  the  “probability  of  interest”  or  POI.  This  is  then  repeated  for  the  rest 
of  the  candidates.  The  user  may  then  choose  not  to  analyze  the  candidates  with  too  low  of  a 
likelihood  score,  saving  time  and  computational  effort. 

This  raises  the  question:  what  constitutes  “too  low”  of  a  POI  score?  There  is  a  natural 
impulse  to  demand  high  values,  such  as  75  percent  or  even  90  percent.  This  seems 
reasonable,  but  may  be  too  restrictive:  if  the  prediction  uncertainty  is  large  relative  to  the 
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response  range  of  interest,  it  is  possible  that  no  eandidate  will  meet  the  requirement. 
Instead,  the  user  should  review  the  likelihood  seores  of  the  eandidates  and  ehoose  a 
requirement  that  makes  sense  in  light  of  those  seores. 

Higher  POI  requirement  values  will  eliminate  more  candidates,  reducing  the  necessary 
computational  effort  to  select  a  new  sample.  This  may  have  the  side-effect  of  eliminating 
the  candidates  with  higher  uncertainties,  i.e.,  those  farther  away  from  existing  samples, 
which  could  result  in  a  very  conservative  approach  that  does  not  explore  the  design  space 
much.  Alternatively,  lower  requirement  values  allow  more  exploration  of  the  design  space, 
but  increase  the  time  and  effort  required  to  select  each  sample,  resulting  in  a  slower  sample 
selection  process.  It  is  recommended  that  a  POI  requirement  of  at  least  0.01-0.1  percent  be 
used  to  screen  out  any  candidates  that  are  extremely  unlikely  to  fall  within  the  range  of 
interest. 

A.7.3  Analyzing  a  Candidate 

The  next  step  is  to  evaluate  how  much  the  predictive  uncertainty  of  the  surrogate  model 
would  be  reduced  if  a  particular  candidate  were  analyzed.  Ordinarily,  this  would  require 
estimating  the  responses  for  the  candidate  sample,  adding  the  sample  to  the  training  data 
set,  and  re-training  the  surrogate  model  to  include  the  new  sample.  This  can  quickly 
become  computationally  expensive  if  more  than  a  handful  of  candidate  samples  must  be 
evaluated.  For  Kriging,  the  computational  expense  is  primarily  due  to  the  need  to  invert  a 
correlation  matrix  each  time  a  surrogate  is  trained.  However,  most  of  the  training  data  set 
will  be  unchanged;  only  the  candidate  sample  will  vary.  This  simplifies  the  problem 
significantly,  as  will  be  demonstrated  shortly. 

The  calculations  and  example  code  given  in  this  section  are  for  a  single-fidelity  model,  not 
the  Ghoreyshi  cokriging  model  that  was  described  in  Section  A.5 .3 .  This  is  because  the 
choice  of  multi-fidelity  approach  will  depend  on  the  problem  being  addressed;  any  multi¬ 
fidelity  approach  could  be  substituted  into  this  sample  selection  algorithm. 

Some  of  the  terms  in  the  calculations  depend  only  on  the  current  surrogate  model,  not  on  any 
of  the  candidates.  These  terms  can  be  calculated  before  any  candidates  are  evaluated: 

Cinv  =  cell(l,Q); 

F  =  cell(l,Q); 

forq  =  1:Q 

daceCholesky  =fuU(dmodel{q}.C); 
daceCinv  =  inv(daceCholesky); 

Cinvfq}  =  daceCinv’  *  daceCinv; 

F{q}  =  fuU(dmodel{q}.C)  *  dmodel{q}.Ft; 

%o  From  DACE  manual,  equation  3.10 
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end 


IMSE  =  zeros  (number _of_candidates,  Q); 

Note  that  there  is  a  ehange  in  notation  that  oecurs  in  those  lines  of  eode:  the  DACE 

model  uses  dmodel{q}.C  to  refer  to  the  Cholesky  factorization  of  the  correlation  matrix,  R, 

for  the  q‘^  surrogate  in  dmodel.  However,  Cin\{q}  is  the  inverse  of  the  correlation  matrix,  not 
the  inverse  of  the  Cholesky  factorization.  This  change  was  done  to  align  with  the  notation 
used  by  Picheny  et  al.[149] 

IMSE  is  a  matrix  that  will  contain  the  weighted  uncertainty  values  that  are  calculated  for 
each  candidate.  Q  is  the  number  of  responses  that  are  being  modeled;  the  meaning  of 
number _o f_candidates  should  be  clear. 

All  operations  after  this  point  must  be  repeated  for  each  candidate  sample  -  or  at  least, 
each  candidate  sample  that  meets  or  exceeds  the  probabUity  of  interest  requirement 
described  in  Section  A.7.2,  if  such  requirements  were  set.  The  variable  j  will  be  used  as 
the  index  denoting  a  particular  candidate  sample  from  the  set  of  options. 

Because  DACE  normalizes  the  data  when  creating  a  Kriging  model,  the  candidate 
sample  must  be  normalized  in  the  same  manner: 

[n,o]  =  size(S); 
mS  =  mean(S); 

sS  =  std(S); 

Snorm  =  (S-  repmat(mS,  n,  1)  )  ./  repmat(sS,  n,  1); 
candidate _norm  =  (candidateSet(j,:)  -  mS)  ./  mS; 

Snorm _can  =  [candidate jiorm;  Snorm] ; 

Here,  n  is  the  number  of  training  samples  that  are  available,  and  o  is  the  number  of  free 
parameters  included  in  the  model. 

Next,  the  normalized  distances  from  the  candidate  sample  to  the  existing  training  data  are 
calculated,  and  those  results  are  used  to  determine  Cnew,  the  correlations  between  the 
candidate  and  the  training  data: 

newS  =  Snorm; 

D  temp  =  repmat (candidate _norm,  n,  1)  -  news(l:n,  :); 

Cjiew  =  feval(correlationModel,  dmodelfq}. theta,  D  temp); 
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Here,  correlationModel  indieates  the  eorrelation  model  that  was  used  to  build  the  Kriging 
surrogate,  such  as  Gaussian  (“@corrgauss”),  exponential  (“@correxp”),  etc.  The 
correlation  model  names  correspond  to  the  correlation  models  included  with  the  DACE 
toolbox. 

To  evaluate  how  the  candidate  would  affect  the  prediction  confidence,  the 
correlation  matrix  (which  includes  the  candidate  sample)  must  be  calculated  and  inverted. 
Rather  than  using  brute  force,  Schur’s  complement[201]  calculates  the  new  inverted 
correlation  matrix  (C“'  )  in  terms  of  inverted  correlation  matrix  without  ihs,  candidate 
sample  (C“'). 


'^«+i 


r  1  1 

1  0 

1  -c^  C~' 

CT^-C^  C~'c 

new  n 

1 

1 

0  I 

|_  n  new  n  j 

L  0  Cj 

«  J 

(19) 


Here,  /„  is  the  nXn  identity  matrix,  and  is  the  estimated  process  variance  of  the  surrogate 

model.  0  and  1  represent  blocks  of  all  zeros  and  all  ones,  respectively.  c„e„  is  a  n  X  1  vector 
containing  the  correlation  between  the  candidate  sample  and  each  existing  training  data 
samples. 

The  augmented  F  matrix  must  be  calculated  as  well: 
fjcandidate  =feval(dmodel{q}.regr,  candidate _norm); 

Fjiplusone  =  [f_candidate;  F{q}] ; 

to_be_inverted_term  =  (Fjiplusone  ’  *  Cjiplusone  inv  *  Fjiplusone); 


The  dmodel{q}.regr  term  indicates  the  underlying  trend  model  that  was  used  for  the 

Kriging  surrogate  dmodel{q}.  These  trend  models,  such  as  “@regpolyl,”  are  included  with 
DACE. 

The  rest  of  the  calculations  depend  on  the  test  samples  as  well  as  the  candidate  sample. 
The  variable  k  will  be  used  as  an  index  to  denote  individual  test  samples,  first,  each  test 
sample  must  be  normalized,  just  like  the  training  data  and  the  candidates: 

testjorm  =  (testSet(k,:)  -  mS)  ./  mS; 

Then,  using  the  normalized  test  sample,  the  distances  to  the  training  data  (including  the 
current  candidate)  and  the  correlation  between  the  test  sample  and  the  training  data  are 
calculated: 

D  jest  =  repmat(test_norm,  n+1,  1)  -  Snorm_can(l :n+ 1 ,  :); 
c_test  =  feval (correlationModel,  dmodel{q}. theta,  Djest); 
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Next,  the  /  value  for  the  test  sample  is  calculated,  which  is  the  last  term  required  to 
calculate  the  prediction  uncertainty  for  that  sample: 

f_test  =  feval(dmodel{q}.regr,  testjiorm); 

Res_MSEs(k,q)  =  dmodel{q}. sigma!  *  (l-c_test’  *  Cjiplusone  inv  *  ... 
c_test  +  (f_test-  cjtest’  *  C  nplusonejnv  *  F jiplusone)  *  ... 

(to_be_inverted_term  1  (f Jest  -  c Jest’  *  CjplusoneJnv  *  ... 

F Jiplusone)  ’ )  ); 

Once  the  new  prediction  uncertainty  has  been  estimated,  the  weighting  value  for  that  test 
sample  can  be  evaluated.  Like  the  probability  of  interest  calculations,  the  weighting 
function  depends  on  the  cumulative  distribution  function  of  the  normal  distribution 
(Equation  16),  so  first  some  logic  is  introduced  to  make  sure  that  the  predicted  response 
value  for  the  test  sample  (/j)  is  less  than  the  threshold  of  interest  (yiower  ovyupper)- 

predictedValuel  =  predictor (testSet(k,:),  dmodel{q}); 

predictedValue!  =  predictedValuel ; 

revl  =  0; 

if  predictedValuel  >  yjpper 

predictedValuel  =  predictedValuel  -  2* (predictedValuel  -  yjpper); 
revl  =  1; 
end 

rev2  =  0; 

if  predictedValuel  >  yjower 

predictedValuel  =  predictedValuel  -  2  *  (predictedValuel  -  yjower); 
revl  =  1; 
end 

After  making  sure  that  the  requirements  for  Equation  19  have  been  satisfied,  P{y  >  yiower) 
and  P(y  >yupper)  can  be  calculated: 

Znorml  =  (yjpper  -  predictedValuel)  /  sqrt(Res_MSEs(k,q)  +  0.00000000001); 
tl  =  1  /  (1  +  bO  *  Znorml); 

standard jormall  =  (1  / sqrt(2*pi*U)  )  *  exp( -  ((Znorml -0)J  /  (2*U)); 
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probl  =  1  -  standard _normall  *  (bl*tl  +  b2*tl^2  +  b3*tl^3  +  b4*tl^4  +b5*tl^5); 
if  revl  ==  1 
probl  =  1  -  probl; 
end 

Znorm2  =  (yjupper  - predictedValue2)  / sqrt(Res_MSEs(k,q)  +  0.00000000001); 
prob2  =  1  -  standard _normal2  *  (bl*t2  +  b2*t2^2  +  b3*t2^3  +  b4*t2^4  +  b5*t2^5); 
if  rev  2  ==  1 
prob2  =  1  -  prob2; 
end 

Note  that  some  division  operations  inelude  an  extra  factor  of  These  additions  were 
made  to  account  for  situations  where  the  uncertainty  at  a  test  sample  is  close  to  zero,  which 
may  be  the  case  if  the  test  sample  is  very  close  to  a  sample  in  the  training  set.  By  adding 
this  factor  of  10“^'^,  the  division  operation  avoids  any  divide -by-zero  errors. 

The  weighting  function  is  equal  to  P{yupper  >  J  >  yiower),  which  can  be  transformed  using 
Equation  18  into: 

W(k,q)  =  (probl  -prob2); 

Lastly,  the  contribution  of  this  test  sample  to  the  overall  weighted  uncertainty  for  the 
current  (/'*)  candidate  is  calculated  as  a  weighted  sum: 
lMSE(i,q)  =lMSE0,q)  +  W(k,q)  *  Res_MSEs(k,q); 

This  sequence  is  repeated  for  each  test  sample.  Then,  the  {j  +  candidate  is  evaluated  in 
the  same  manner.  This  continues  until  all  candidates  have  been  evaluated.  If  necessary,  the 
algorithm  then  moves  on  to  the  next  response  (from  ^  to  ^  +  1)  and  the  analysis  begins  again, 

If  the  sample-selection  algorithm  is  evaluating  only  one  response,  the  best  sample  can  be 
selected  at  this  point.  The  sample  that  will  most  reduce  predictive  uncertainty  in  the 
response  range  of  interest  is  the  one  with  the  lowest  IMSE  value. 

If  more  than  one  response  has  a  specified  range  of  interest,  it  is  likely  that  the  IMSE 
values  for  each  response  will  have  very  different  magnitudes.  Unless  this  is  addressed,  the 
response  with  the  largest  IMSE  values  will  dominate  the  sample  selection  process.  To 
account  for  this  factor,  the  IMSE  values  for  each  response  are  normalized  to  have  a  mean 
value  of  0  and  a  standard  deviation  of  1 ; 

avglMSE  =  ones(Q,l); 
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stdIMSE  =  ones(Q,l); 
norm_IMSE  =  zeros  (size (IMSE)); 
forq  =  1:Q 


avglMSE(q)  =  mean(IMSE(:,q)); 
stdlMSE(q)  =  std(IMSE(:,q)); 

norm_IMSE(:,q)  =  ( IMSE(:,q)  -  avglMSE(q)  )  /  stdlMSE(q); 
end 

The  normalized  IMSE  values  for  eaeh  eandidate  are  averaged  to  quantify  the  overall  effect 
that  is  expected  if  that  candidate  were  to  be  analyzed.  The  results  are  then  sorted  and  the 
most  effective  sample  is  identified: 

net_IMSE  =  zeros  (number _of_candidates,  1); 

forj  =  1:  number _of_candidates 

net_IMSE(j)  =  mean( norm_IMSE(j,:)  ); 

end 

results  =  sortrows(  [net_IMSE  candidateSet]  ); 
new_sample  =  results (1,  2: number _of_dimensions  +  1); 

The  sample  that  is  selected,  new_sample,  is  the  candidate  that  is  expected  to  be  the  most 
useful  new  sample  to  analyze. 

This  algorithm  is  not  infallible.  It  can  only  select  one  of  the  candidates  that  are  presented 
to  it;  it  does  not  choose  the  best  sample  possible  within  the  design  space.  Additionally,  the 
algorithm  assumes  that  the  current  surrogate  model  is  fairly  accurate  with  respect  to  the 
behavior  of  the  response.  If  the  surrogate  model  is  poor,  the  candidates  will  not  be 
evaluated  accurately,  and  it  is  unlikely  that  the  best  candidate  will  be  selected. 

Once  the  new  set  of  inputs  has  been  selected,  it  can  be  analyzed  to  determine  the  true 
response  values.  Once  the  response  values  are  available,  an  updated  surrogate  model 
should  be  trained  and  evaluated  -  essentially,  returning  to  Step  Two  of  this  guide.  The 
process  continues  until  the  surrogate  models  are  deemed  acceptable  or  the  available 
resources  run  out,  as  described  in  Section  A.6.4. 

The  strategy  of  analyzing  each  new  sample  as  soon  as  it  is  selected  may  not  be  the  most 
efficient  one.  This  is  especially  true  if  some  or  all  of  the  analysis  (including  both  the  setup 
and  the  analysis)  can  be  done  in  parallel.  In  such  a  scenario,  it  may  be  more  efficient  to 
select  a  batch  of  samples  and  analyze  them  all  at  once  before  updating  the  surrogate  model. 
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Selecting  multiple  samples  introduces  the  risk  that  the  second  sample  in  a  batch  might  be 
very  close  to  the  first,  reducing  its  usefulness.  To  avoid  this,  each  sample  is  added  to  the 
training  data  pool  (using  the  response  values  estimated  by  the  current  surrogate  models) 
before  subsequent  samples  are  selected.  This  has  the  effect  of  reducing  the  prediction 
uncertainty  around  the  “new”  sample,  diminishing  the  incentive  to  place  later  samples  in 
the  same  region  of  the  design  space. 

Because  estimated  response  values  are  used  rather  than  true  response  values,  the  later  sets  of 
input  values  in  the  batch  will  be  chosen  based  on  information  that  is  not  entirely  up-to-date. 
This  may  lead  to  sub-optimal  selections  in  some  cases.  Still,  if  the  analysis  time  is 
significant  and  multiple  samples  can  be  analyzed  in  parallel,  the  overall  execution  time  may 
be  substantially  reduced  by  this  approach. 

To  select  multiple  cases  without  executing  the  analysis,  the  user  should  run  the  sample 
selection  algorithm  repeatedly,  adding  the  selected  sample  to  the  model  after  each  round. 
The  augmented  data  set  is  then  used  when  selecting  the  next  sample.  By  taking  this 
approach,  subsequent  samples  in  each  batch  will  naturally  spread  out,  with  no  risk  of 
clumping  or  clustering. 

After  a  batch  of  samples  is  selected,  they  can  then  be  analyzed  all  at  once,  taking  advantage 
of  any  opportunities  to  perform  the  analyses  in  parallel.  This  approach  can  significantly 
reduce  the  total  time  to  select  and  analyze  a  given  number  of  samples,  while  reducing  the 
negative  consequences  of  not  analyzing  each  sample  immediately. 
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Appendix  B:  Geometric  Parameter  Ranges  and  Default 
Values 

A  full  description  of  the  geometric  parameters  which  define  the  outer  shape  of  a  reusable 
booster  vehicle  may  be  found  in  the  work  of  Garmendia  et  al.[61]  The  details  should  not 
affect  the  validity  of  the  methods  described  herein,  and  thus  were  not  reproduced  within  this 
document. 

Two  parameters,  Fuselage  Radius  Fraction  and  Wing  Root  Chord  Fraction,  were 
independent  variables  in  every  experiment  and  thus  default  values  were  not  assigned  for 
these  parameters.  Additionally,  some  of  the  experiments  featured  9  independent 
parameters  rather  than  2.  For  these  experiments,  the  variables  Top  Curvature  1,  Top 
Curvature  2,  Nose  Droop,  Nose  Fineness  Ratio,  Wing  Half-Span  Fraction,  Wing  Airfoil 
Camber,  and  Area  Ratio  of  Vertical  Tail  to  Wing  were  allowed  to  vary  alongside  Fuselage 
Radius  Fraction  and  Wing  Root  Chord  Fraction.  These  variables  are  bolded  and  marked 
with  the  symbol  ^ . 

When  selecting  default  values  for  the  parameters  that  would  be  inactive  in  the  smaller- 
scale  experiments,  the  objective  was  to  identify  values  that  would  produce  consistent 
and/or  interesting  response  behavior  for  the  purposes  of  the  experiments.  In  some  cases, 
the  default  value  selected  was  outside  the  nominal  range  for  that  parameter.  This  was 
considered  to  be  acceptable  because  these  default  values  were  used  as  part  of  smaller 
experiments  but  never  as  part  of  the  larger  scale  experiments,  and  thus  the  pre-existing 
space-fdling  data  sets  which  were  constrained  to  those  limits  were  still  usable  as  the  null 
hypothesis  for  the  larger  scale  efforts.  Parameters  for  which  defaults  lie  outside  the  nominal 
range  are  marked  with  the  symbol 


Parameter 

Default 

Value 

Min. 

Value 

Max. 

Value 

Units 

Scale 

50 

40 

100 

Feet 

LoftStart 

0.6 

0.40 

0.95 

Fraction  of  Scale 

LoftEnd 

0.15 

0.05 

0.20 

Fraction  of  Scale 

Fuselage 
Radius  Fraction 

- 

0.05 

0.25 

Fraction  of  Scale 

1.07 

1.00 

1.15 

Fraction  of 

Flare  Factor 

Fuselage  Radius 

Top 

Curvature  1  ^ 

0.1 

0 

1 

Unitless 

Top 

Curvature  2^ 

0.6 

0 

1 

Unitless 

Bottom 
Curvature  1 

0.1 

0 

1 

Unitless 

Bottom 
Curvature  2 

0.6 

0 

1 

Unitless 

Nose  Droop  ^ 

0.57 

0.5 

1 

Fraction  of 
Fuselage  Height 

Nose  Spatularity 

0.15 

0 

1 

Unitless 
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Parameter 

Default 

Value 

Min.  Max. 
Value  Value 

Units 

Nose  Fineness 
Ratio  ^ 

1.7 

1  3 

Unitless 

Inboard  Wing 
Sweep  Angle^ 

75 

40  70 

Degrees 

Outboard  Wing 
Sweep  Angle 

45 

10  60 

Degrees 

Wing  Root 
Chord  Fraction 

- 

0.3  0.7 

Fraction  of  Scale 

Inboard 

Taper  Ratio 

0.6 

0.5  0.7 

Unitless 

Outboard 
Taper  Ratio 

0.35 

0.25  0.95 

Unitless 

Wing  Half-Span 
Fraction^ 

0.34 

0.1  1.5 

Fraction  of  Scale 

Wing  Crank 
Location 

0.17 

0.1  0.5 

Fraction  of 
Wing  Half-Span 

Wing  Tip 

Twist  Angle 

0 

-5  0 

Degrees 

Wing  Incidence 
Angle 

0 

0  3 

Degrees 

Wing  Dihedral 
Angle 

0 

0  12 

Degrees 

Wing  Airfoil 
Camber^ 

0.07 

0  6 

Fraction  of  Local 
Chord 

(in  10%  Increments) 

Wing  Airfoil 
Position  of 
Maximum  Camber 

5 

2  6 

Fraction  of  Local 
Chord 

(in  10%  Increments) 

Wing  T/C  Ratio 

7 

3  8 

Fraction  of  Local 
Chord 

(in  10%  Increments) 

Parameter 

Default 

Value 

Min.  Max. 
Value  Value 

Units 

Wing  Location  of 
Max. 
Thickness 

3 

2  8 

Fraction  of  Focal 
Chord 

(in  10%  Increments) 

Leading  Edge 
Radius  Factor 

6 

2  8 

Unitless 

Area  Ratio  of 
Vertical  Tail 

to  Wing^ 

0.4 

0.1  0.5 

Unitless 

Vertical  Tail 
Cant  Angle 

10 

0  30 

Degrees 

Vertical  Tail 

5 

0  10 

Degrees 
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Parameter 

Default 

Value 

Min,  Max, 
Value  Value 

Units 

Toe-In  Angle 

Vertical  Tail 
Leading  Edge 
Sweep  Angle 

45 

20  55 

Degrees 

Vertical  Tail 
Aspect  Ratio 

1.25 

0.5  2.0 

Unitless 

Vertical  Tail 
Taper  Ratio^ 

0.4 

0.5  0.8 

Unitless 

Vertical  Tail 
Airfoil  Camber 

0 

0  6 

Eraction  of 
Local  Chord 

Vertical  Tail 
Position  of 

Maximum  Camber^ 

0 

2  6 

Eraction  of  Local 
Chord 

(in  10%  Increments) 

Vertical  Tail 

T/C  Ratio 

7 

3  8 

Unitless 

Vertical  Tail 
Location  of 
Max.  Thickness 

3 

2  8 

Eraction  of  Local 
Chord 

(in  10%  Increments) 

Inboard  Elevon 
Depth 

30 

10  40 

Unitless 

Outboard  Elevon 
Depth 

30 

10  40 

Unitless 

Rudder  Depth 

30 

10  40 

Unitless 

Body  Elap  Size 

0.10 

0.08  0.015 

Eraction  of  Scale 

Inboard  Elevon 
Deflection 
(Starboard) 

0 

-30  30 

Degrees 

Outboard  Elevon 
Deflection 
(Starboard) 

0 

-30  30 

Degrees 

Inboard  Elevon 
Deflection  (Port) 

0 

-30  30 

Degrees 

Outboard  Elevon 
Deflection  (Port) 

0 

-30  30 

Degrees 

Rudder  Deflection 
(Starboard) 

0 

-30  30 

Degrees 

Rudder 

Deflection  (Port) 

0 

-30  30 

Degrees 

Body  Elap 
Deflection 

0 

-20  30 

Degrees 

Wing  Longitudinal 
Position 

-0.1 

-0.15  0 

Eraction  of  Scale 
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Appendix  C:  Additional  Results  from  “Probability  of 
Interest”  Studies 

This  appendix  presents  the  results  from  the  experiment  in  Seetion  6.9  in  greater  depth.  This 
experiment  featured  2  free  parameters,  the  Fuselage  Radius  Fraction  and  the  Wing  Root  Chord 
Fraction,  and  3  responses.  The  responses  were  the  pitching  moment  coeflhcient  of  the  vehicle 
at  three  flight  conditions;  Mach  0.3,  a  15",  jS  0";  Mach  0.8,  a  0",  P  0";  and  Mach  2.5,  a  0",  P  0". 
Default  settings  for  the  remaining  geometric  parameters  are  recorded  in  Appendix  B; 

The  contour-based  sampling  algorithm,  which  drew  only  on  results  from  Cart3D  for  this 
experiment,  was  tasked  with  reducing  prediction  variance  for  cases  likely  to  have  pitching 
moment  coefficients  within  ±0.1  at  every  flight  condition.  A  50  x  50  grid  of  cases  spanning 
the  design  space  was  analyzed  with  Cart3D.  Based  on  the  results  of  those  analyses,  the  cases 
which  met  the  pitching  moment  criterion  at  each  flight  condition  can  be  seen  in  Figure  a,  2b  & 
2c.  The  cases  which  met  the  pitching  moment  criterion  at  every  flight  condition  are  shown  in 
Figure  2d.  These  cases  will  be  henceforth  referred  to  as  the  “cases  of  interest.” 


Figure  C-1:  Cases  of  Interest  at  All  Flight  Conditions 
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The  analysis  results  were  interpolated  so  that  sampling  experiments  eould  be  conducted  without 
having  to  analyze  each  selected  case  as  it  is  requested.  This  allowed  experiments  to  be 
conducted  quite  rapidly,  although  the  time  required  for  the  contour-based  sampling  algorithm  to 
select  the  next  sample  was  not  insignificant.  The  algorithm  used  a  23x23  grid  of  candidates  and 
a  40x40  grid  of  test  points.  The  same  candidates  and  test  points  were  used  in  every  round  of 
sampling. 

The  contour-based  sampling  algorithm  for  one  response  as  described  by  Picheny  et  al.[149]  was 
extended  to  identify  samples  that  would  improve  prediction  confidence  for  multiple  responses 
at  once.  When  this  extended  algorithm  was  applied  to  the  two-input,  three-response  problem 
described  above,  it  was  found  that  the  algorithm  would  select  samples  that  improved  prediction 
confidence  for  any  response,  even  if  the  selected  sample  did  not  benefit  every  response.  The 
samples  selected,  and  the  resulting  prediction  accuracy  for  the  cases  of  interest,  may  be  seen  in 
Figure  64. 


Figure  64:  Sampling  For  Required  POI  of  0%  (a)  &  95%  Prediction  Error  Quantiles  for  Cases  of  Interest 

at  (b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 


The  algorithm  was  modified  to  emphasize  regions  that  were  likely  to  contain  cases  of  interest 
(e.g.,  cases  for  which  all  responses  lie  within  the  range  of  interest)  based  on  the  expected 
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Probability  of  Interest,  or  POL  This  quantity  captures  the  likelihood  that  every  response  for  a 
particular  sample  would  fall  within  the  specified  range(s)  of  interest.  See  Section  6.9.1  for 
more  information  on  this  topic.  After  modification,  the  algorithm  would  not  evaluate  candidates 
which  had  POI  values  lower  than  the  user-defined  requirement.  If  no  candidates  met  the 
requirement,  the  algorithm  would  select  the  candidate  with  the  highest  POI  value. 

A  minimum  POI  requirement  of  1%  -  that  is,  only  candidates  with  better  than  a  1  percent 
likelihood  of  falling  in  the  region  of  interest  for  all  three  responses  -  produced  the  sample 
distribution  seen  in  Figure  65a.  The  samples  exhibited  significantly  more  clustering  in  the 
region  of  interest,  which  was  the  desired  effect.  Additionally,  the  95  percent  error  quantiles 
converged  more  quickly  than  in  the  previous  case. 
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Figure  65:  Sampling  For  Required  POI  of  1%  (a)  &  95%  Prediction  Error  Quantiles  for  Cases  of  Interest 

at  (b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 


Increasing  the  required  POI  value  to  5  percent  produced  the  results  shown  in  Figure  66. 
Importantly,  in  this  scenario  the  algorithm  did  not  identify  the  region  of  interest  as  quickly  as 
before;  the  third  sample  for  1  percent  POI  was  in  or  near  the  region  of  interest,  while  at  5 
percent  POI  the  region  was  not  identified  until  the  fifth  sample.  The  prediction  accuracy 
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results  for  cases  of  interest  also  showed  convergence  behavior  that  was  delayed  relative  to  the  1 
percent  case,  although  it  was  still  an  improvement  compared  to  the  0  percent  case. 
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Figure  66:  Sampling  For  Required  POI  of  5%  (a)  &  95%  Prediction  Error  Quantiles  for  Cases  of  Interest 

at  (b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 

Figure  67  shows  the  effects  of  a  10  percent  POI  requirement.  The  samples  were  even  more 
tightly  clustered  than  in  previous  scenarios.  Of  particular  importance  was  the  order  of  samples: 
note  that  the  first  5  samples  gradually  progressed  down  the  left  edge  of  the  space.  The  POI 
requirement  restricted  the  candidates  which  the  algorithm  could  consider,  allowing  only  those 
with  higher  likelihoods  to  be  sampled. 
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(C)  (d) 

Figure  67:  Sampling  For  Required  POI  of  10%  (a)  &  95%  Prediction  Error  Quantiles  for  Cases  of  Interest 

at  (b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 

It  should  be  noted  that  a  low  POI  value  did  not  necessarily  indicate  that  a  given  candidate 
was  expected  to  be  far  from  the  region  of  interest.  The  POI  was  estimated  using  the 
cumulative  distribution  function  of  a  normal  distribution;  even  if  the  predicted  response  was 
exactly  centered  in  the  region  of  interest,  the  POI  might  still  be  low  if  the  prediction  uncertainty 
was  large  relative  to  the  integration  limits.  Responses  that  were  poorly  represented  by  a  linear 
trend  model  would  have  high  estimated  process  variance,  and  in  turn  would  have  high  estimated 
prediction  uncertainty.  See  Section  6.1 1.1  for  more  details  on  this  topic. 

With  respect  to  prediction  accuracy,  the  10  percent  POI  results  showed  slightly  delayed 
convergence  relative  to  the  previous  scenarios,  although  the  convergence  for  the  Mach  0.3 
response  was  significantly  more  smooth  than  for  other  scenarios. 

A  15  percent  POI  requirement  produced  the  results  shown  in  Figure  68.  The  effect  of  POI 
requirements  on  sample  selection  was  even  more  obvious  here  than  it  was  for  the  case  of  1 0 
percent  POI.  Because  the  prediction  uncertainty  was  small  in  the  region  around  each  original 
sample  (represented  by  triangular  icons),  the  algorithm  selected  cases  close  to  those  samples. 
This  process  repeated  as  the  samples  march  through  the  space  in  search  of  desirable  response 
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behavior.  After  the  seventh  sample,  there  was  enough  eonfidenee  in  the  estimated  model 
behavior  to  make  the  jump  to  the  region  of  interest,  whieh  was  correctly  identified.  Subsequent 
samples  remained  close  to  this  region  for  the  most  part. 


(a) 


(c) 


(b) 


(d) 


Figure  68:  Sampling  For  Required  POI  of  15%  (a)  &  95%  Prediction  Error  Quantiles  for  Cases  of  Interest 

at  (b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 


Once  again  the  prediction  error  quantiles  converged  more  slowly  than  in  previous  scenarios. 
The  final  95  percent  quantiles  for  the  Mach  0.8  response  were  actually  slightly  wider  than  for 
the  10  percent  scenario,  indicating  that  after  15  samples  the  model  produced  by  the  15 
percent  POI  requirement  was  slightly  less  accurate  for  that  response. 


Finally,  the  experiment  was  repeated  with  a  POI  requirement  of  25  percent.  The  effects  of  a  high 
POI  value  are  clearly  visible  in  the  results,  displayed  in  Figure  69.  The  sample  progression 
showed  tight  clustering  with  occasional  jumps;  although  the  left  side  of  the  region  of  interest  was 
sampled  by  the  eighth  sample,  the  extent  of  that  region  was  not  discovered  until  the  thirteenth 
sample. 
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Figure  69:  Sampling  For  Required  POI  of  25%  (a)  &  95*%  Prediction  Error  Quantiles  for  Cases  of  Interest  at 

(b)  Mach  0.3,  (c)  Mach  0.8,  and  (d)  Mach  2.5 


The  slow  exploration  proeess  that  resulted  from  the  high  POI  requirement  is  evident  in  the 
eonvergenee  history  of  predietion  error  for  eaeh  response,  as  seen  in  Figure  69b,  8e  &  8d.  After 
15  adaptive  samples,  eontour-based  sampling  out-performed  spaee-fdling  sampling  only  for 
the  Maeh  2.5  response. 


The  results  in  this  Appendix  should  serve  to  demonstrate  the  effect  of  the  POI  requirement. 
Experiments  revealed  that  even  an  ostensibly  reasonable  POI  requirement  such  as  15  percent  could 
handicap  the  sampling  algorithm  compared  to  a  more  lenient  setting.  This  was  due  to  the  way 
that  POI  was  calculated;  candidates  which  are  expected  to  have  desirable  response  values  might 
exhibit  low  POI  values  if  the  prediction  variance  was  large.  Thus,  particularly  in  the  early 
stages  of  sampling,  it  may  be  best  to  use  a  relatively  low  POI  requirement  to  allow  the  sampling 
algorithm  to  explore  and  gain  a  better  understanding  of  response  behavior.  This  requirement 
could  be  raised  later  on  in  the  sampling  process  when  there  was  more  confidence  that  each 
response  can  be  approximated  with  reasonable  accuracy. 
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Appendix  D:  Description  of  Scripts  for  Operation  of  CartSD 
and  Computing  Resources 

This  appendix  will  provide  details  about  the  way  that  CartSD  and  the  High  Performanee 
Computing  systems  were  run. 

D.  1  Preprocessing 

This  researeh  effort  focused  on  the  aerodynamics  of  reusable  booster  vehicles  during  the 
unpowered  return-to-launch-site  phase  of  operation,  and  in  particular  the  static  moments.  The 
aerodynamics  were  calculated  based  on  the  flight  condition  and  the  outer  geometric  shape,  or 
Outer  Mold  Line  (OML),  of  the  vehicle.  This  shape,  including  control  surfaced  deflections,  could 
be  defined  via  a  set  of  parameters  using  the  PaceLab  vehicle  definition  tool.  [61,  142]  For  the 
purposes  of  this  research,  the  vehicle  definition  tool  was  only  used  to  generate  a  triangular  surface 
mesh,  which  could  then  be  analyzed  using  CartSD,  and  a  text  file  which  identified  the  hinge  axis 
of  each  control  surface. 

D.2  Generation  of  Surface  Meshes 

The  PaceLab  tool  generated  the  wing  and  fuselage  as  a  halfbody,  as  well  as  each  control  surface  in 
its  deflected  state.  The  wing  and  half- fuselage  were  converted  from  .STL  format  to  .TRI  format 
using  the  stl2tri,pl  and  off2tri,pl  utilities  (packaged  with  CartSD)  and  ADMesh,  a  utility  which 
processes  solid  triangular  meshes.  [2]  The  two  TRI  files  were  then  intersected  using  CartSD ’s 
intersect  utility  to  produce  a  single,  watertight  triangular  surface  mesh  for  half  the  vehicle. 
Almost  invariably,  any  failures  at  this  stage  were  found  to  be  problems  with  the  STL  files.  Most 
commonly,  the  STL  file  would  not  describe  a  closed  surface  (commonly  referred  to  as 
“watertight”)  and  when  ADMesh  was  run,  that  program  would  attempt  to  create  a  closed  surface 
if  possible.  The  resulting  triangulation  could  be  of  poor  quality  -  i.e.,  very  long  thin  triangles  - 
and  still  might  not  produce  a  closed  surface. 

The  cubes  utility,  which  generates  a  volumetric  grid  around  the  shape,  was  used  to  test  whether  the 
triangulation  for  the  wing  &  half-fuselage  were  of  sufiicient  quality  for  CartSD  analysis.  This 
was  done  by  generating  a  volumetric  mesh  of  roughly  the  resolution  that  would  be  used  in 
later  analyses,  with  approximately  2-3  million  cells.  Any  deficiency  in  the  surface  triangulation 
would  produce  an  error  message  and  the  triangulation  would  be  considered  to  have  failed.  If  the 
triangulation  passed  this  test,  it  would  then  be  mirrored  about  its  centerline  to  create  a  full  fuselage 
and  wing  using  a  Matlab-based  utility  created  by  Jonathan  Sharma.  The  seven  control  surfaces 
would  then  be  intersected  with  the  fuselage  and  wing,  and  cubes  would  once  again  be  used  to  test 
for  gaps  or  other  deficiencies  in  the  mesh. 

When  a  particular  case  failed  to  build  properly,  the  root  of  the  problem  was  sometimes  to  be 
found  in  the  wing  control  surfaces.  The  PaceLab  tool  “cuts”  control  surfaces  from  the  fixed 
wing  using  the  nearest  available  surface  nodes;  a  wing  mesh  with  higher  resolution  was  more 
likely  to  result  in  a  viable  TRI  file,  but  this  also  produced  relatively  large  files.  The  Matlab 
script  which  drove  the  geometry  generation  process  was  written  so  that  the  wing  mesh 
resolution  was  started  relatively  low  —  ncs,  the  number  of  chordwise  cross-sections  between  the 
wing  root  and  the  wing  tip  was  usually  set  to  130  and  pcs,  the  number  of  vertices  per  cross- 
section,  was  often  set  to  140  —  and  then,  if  any  cases  failed  to  build,  these  two  parameters  were 
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increased  by  20  and  the  process  repeated.  If  ncs  exeeeded  290,  any  eases  whieh  had  not  been 
built  suceessfully  were  abandoned. 


D.3  Defining  Cart3D  and  HPC  Settings 

Aside  from  the  surface  mesh,  a  number  of  other  files  had  to  be  generated  before  eaeh  analysis 
could  be  done.  A  number  of  seripts  were  ereated  whieh  automated  the  generation  these  other  files. 

A  centralized  data  file  was  created  whieh  encapsulated  all  of  the  data  required  to  set  up  a  ease. 

This  master  data  file  ineluded; 

•  Bateh  name  and  case  number; 

•  Flight  condition  ID  number; 

•  Maeh  number,  angle  of  attack,  and  sideslip  angle; 

•  Orientation  of  vehiele  within  loeal  coordinate  frame; 

•  Reference  area,  mean  aerodynamie  chord,  and  wing  half-span  for  normalization  of  forces 
&  moments  into  eoetfieients; 

•  Center  of  mass  position  in  the  local  coordinate  frame,  based  on  mass-estimating 
relationships  developed  by  Loekheed  Martin  Spaee  Systems; 

•  Number  of  nodes  on  the  HPCC  system  to  be  requested  for  each  case; 

•  Length  of  time  these  nodes  should  be  reserved; 

•  CartSD  baseline  CFL  setting;  and 

•  Definition  of  eontrol  surfaee  hinge  lines,  for  the  caleulation  of  hinge  moments. 

Most  of  these  items  are  straight-forward,  but  some  may  deserve  further  explanation. 

The  batch  name  and  case  number  uniquely  identified  eaeh  case  and  eonneeted  it  to  the  input 
settings  used  to  generate  the  surfaee  mesh. 

The  flight  condition  ID  number  was  a  two-digit  value  which  corresponded  to  particular  values 
of  Maeh  number,  angle  of  attack,  and  sideslip  angle. 

The  local  orientation  of  the  vehicle  told  CartSD  how  the  vehicle  was  oriented  with  respect  to  its 
internal  numerieal  eonventions.  The  surface  mesh  generated  by  PaceLab  had  its  origin  at  the 
lower  eorner  of  the  fuselage  rear  surface,  where  the  engine  assembly  or  assemblies  would 
attaeh,  and  the  nose  extended  in  the  negative-X  direction.  The  right  wing  of  the  vehicle  was  in 
the  positive-Y  direction,  and  the  upper  surface  of  the  vehicle  was  in  the  positive-Z  direction.  The 
native  orientation  of  CartSD  is  to  have  the  nose  of  the  vehicle  be  the  most  positive  X-coordinate, 
and  the  top  of  the  vehicle  be  in  the  negative-Z  direction,  so  any  deviation  from  that  orientation 
must  be  noted  or  else  the  condition  simulated  may  not  be  the  one  intended.  In  CartSD  terms,  the 

PaceLab  orientation  is  described  as  (-Xh,  Yb,  -Zb). 

The  reference  area  &  lengths  were  calculated  using  the  geometrie  planform  of  the  wing,  which  was 
composed  of  two  trapezoidal  panels.  These  values  were  expressed  in  millimeters  since  that  was  the 
length  scale  used  by  PaceLab  when  exporting  the  surfaee  mesh. 

The  CFL  setting  for  CartSD  acts  as  a  sort  of  pseudo-“time  step”  similar  to  the  forward-Euler 
method  for  initial  value  problems. [70]  CartSD  is  a  statie  rather  than  dynamic  analysis  so  this 
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is  something  of  a  false  analogy,  but  the  flow  solver  arrives  at  a  solution  by  iterating  until  the 
flow  behavior  throughout  the  spaee  has  eonverged  to  some  steady-state  solution.  The  CFL  value 
determines  how  mueh  the  flow  ean  ehange  between  iterations. [37]  The  larger  the  CFL  value,  the 
more  the  state  of  the  simulation  will  ehange  after  eaeh  iteration.  Smaller  values  eorrespond  to  a 
smaller  ehange  per  iteration,  whieh  slows  eonvergenee  to  a  solution  but  reduees  sensitivity  to 
numerieal  instabilities,  such  as  may  occur  in  transonic  or  highly-separated  flow.  The  default 
value  in  CartSD  is  1.1  and  the  minimum  value  used  in  CartSD’s  adaptive  grid  refinement  script 
aero.csh  is  0.2.  If  convergence  problems  are  identified  during  an  analysis,  the  minimum  value  is 
used.  CFL  numbers  used  in  this  effort  occasionally  ranged  as  high  as  1 .3  if  the  problem  was 
expected  to  be  well-behaved  numerically. 

Once  created,  the  master  job  description  file  would  be  uploaded  to  a  High  Performance 
Computing  system,  along  with  the  surface  triangulations.  Additional  scripts  would  then  be  used 
to  set  up  each  case  to  be  run. 

D.4  Setting  Up  Cases  on  HPC  Systems 

An  assortment  of  files  was  used  to  define  and  run  each  Cart3D  analysis.  Perl  scripts  were  written 
to  customize  these  files  to  the  case  being  run  and  the  desired  Cart3D  behavior. 

D.4.1  config.xml 

This  file,  expressed  in  XML,  gives  names  to  each  of  the  numbered  components  in  the  surface 
mesh.  The  component  numbers  are  derived  from  the  order  they  are  listed  in  the  intersect  command 
during  geometry  generation.  Although  not  strictly  necessary  for  use  of  Cart3D,  the 
configuration  name  in  this  file  is  set  using  the  batch  name,  case  number,  and  flight  condition 
ID  from  the  master  job  description  file. 

D.4.2  input.cnti 

This  is  the  primary  file  which  controls  Cart3D.  Information  from  the  master  job  description  file 
is  used  to  set  values  for  the  Mach  number,  alpha  (angle  of  attack)  and  beta  (sideslip  angle).  A 
CFL  number  is  defined  here  using  the  specified  baseline  value,  although  this  value  may  be 
overridden  by  a  similar  entry  in  aero.csh.  The  orientation  of  the  vehicle  is  defined  in  this  file  as 
well,  along  with  the  reference  area  and  one  reference  length.  Only  one  reference  length  can  be 
specified  in  this  file,  so  the  mean  aerodynamic  chord  is  commonly  used  for  normalization  of  the 
pitching  moment.  The  reference  length  for  lateral  moments  will  come  into  play  in  the  CLiC  files, 
described  below. 

The  objective  function  for  the  adaptive  gridding  logic  of  aero.csh  is  also  specified  in  this  file. 
Although  the  objective  function  can  incorporate  any  combination  of  the  aerodynamic  forces, 
it  can  only  include  one  of  the  aerodynamic  moments  at  a  time.  For  these  efforts,  the  pitching 
moment  coefficient  and  drag  coefficient  were  given  weights  of  1 ,  while  the  lift  coefficient  was 
given  a  weight  of  0.5.  For  flight  conditions  with  nonzero  sideslip  angles,  the  side  force 
coefficient  was  also  given  a  weight  of  0.5.  No  rigorous  evaluation  of  the  effect  of  these  values 
on  Cart3D  behavior  was  undertaken. 
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This  file  is  created  based  on  information  from  the  master  job  description  file  during  the  job 
creation  process  using  the  script  batch_adjoint_multiple.pl,  which  in  turn  calls  the  script 

c3d_input_maker.pl. 

D.4.3  clic.cnti 

CLiC  is  a  post-processing  module  which  takes  as  input  an  annotated  triangulation  of  a  body 
or  vehicle  (a  TRIQ  file)  and  produces  force  and  moment  coelficients  for  the  entire  configuration 
and/or  for  individual  components. [3 1]  The  module  also  requires  an  input  script,  named  clic.cntlby 
default,  which  defines  the  relevant  parameters.  These  parameters  include  the  local  vehicle 
orientation,  the  angle  of  attack,  the  sideslip  angle,  the  reference  area  and  reference  length,  the 
location  of  the  center  of  mass,  and  (if  desired)  the  component  number  and  reference  axis  for 
component-level  moments.  Such  component-level  moments  can  be  used  to  calculate  the  hinge 
moments  necessary  to  deflect  the  control  surfaces  to  the  specified  positions. 

Two  sets  of  CLiC  input  files  are  generated  for  each  analysis  using  Perl  scripts.  One  set  uses  the 
mean  aerodynamic  chord  as  the  reference  length  for  accurate  calculation  of  the  pitching 
moment  coelficient,  while  the  other  set  uses  the  half-span  for  calculation  of  the  rolling  and 
yawing  moment  coelficients.  Each  set  also  includes  four  center-of-mass  (COM)  locations.  The 
first  location  is  the  one  calculated  by  the  mass  estimating  relationships,  and  serves  as  the  best 
guess  of  the  true  COM  location  of  the  vehicle.  The  other  three  locations  are  derived  by  nudging 
the  COM  one  meter  along  each  Cartesian  axis.  The  resulting  changes  in  the  aerodynamic 
moments  can  be  used  to  determine  the  effect  of  shifting  the  COM  of  the  vehicle,  such  as  different 
distributions  of  internal  components  such  as  batteries  or  RCS  propellant  tanks.  The  change  in 
aerodynamic  moments  due  to  a  change  in  the  COM  of  the  vehicle  can  also  be  calculated  using 
the  angle  of  attack,  sideslip  angle,  reference  lengths,  and  aerodynamic  force  coefficients;  data 
from  the  extra  CLiC  results  serve  as  a  confirmation  and  double-check  of  the  analytically-derived 
results. 

These  CLiC  input  files  are  generated  during  the  job  creation  process  that  is  carried  out  by  the 
Perl  script  batch_adjoint_multiple.pl,  which  in  turn  calls  the  scripts  clic_input_maker.pl, 
clic_input_maker_perturbx.pl,  clic_input_maker_perturby.pl,  and 
clic_input_maker_perturbz.pl,  which  in  turn  would  instruct  CLiC  to  calculate  the  baseline 
forces  and  moments  plus  the  effects  of  a  1 -meter  COM  change  in  the  x-,  y-,  and  z-directions 
respectively. 

D.4.4  aero.csh 

This  script  was  included  as  part  of  the  CartSD  distribution.  It  executed  both  the  adaptive 
gridding  logic  and  the  flow  solver,  cycling  between  the  two  to  refine  the  volumetric  mesh  in  the 
regions  which  most  affected  the  solution,  as  calculated  using  the  weighting  function  from  the 
input.cntl  file.  Most  parameters  were  left  unchanged  from  the  default  settings,  aside  from  the 
CLL  value  (specified  in  the  master  job  description  file)  and  the  span  wise  orientation  flag 
(changed  to  indicate  that  the  Y  dimension  corresponded  to  the  wingspan  of  the  vehicle). 

Additionally,  the  mesh  growth  rates  would  be  tweaked  for  each  flight  condition.  These  rates 
would  be  adjusted,  following  the  trend  in  the  default  aero.csh  fide  of  larger  rates  in  later  rounds, 
until  the  script  produced  grids  with  roughly  two  or  three  million  cells,  as  evaluated  with  one 
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or  two  dozen  configurations.  Typically,  supersonic  flight  conditions  required  higher  growth 
rates  to  produce  that  number  of  cells;  it  is  believed  that  the  shock  waves  present  at  those 
conditions  resulted  in  a  more  rapid  convergence  of  surface  pressure  distribution. 

It  is  quite  possible  that  volumetric  grids  with  fewer  cells  would  produce  results  that  were  equally 
accurate  at  those  flight  conditions.  Testing  this  possibility  would  require  a  series  of  mesh 
sensitivity  studies,  repeated  for  different  portions  of  the  design  space  and  different  flight 
conditions.  Given  the  high  availability  of  computing  resources,  it  was  concluded  that  those 
studies  were  unlikely  to  identify  sufficient  potential  savings  to  justify  their  execution. 

Perl  scripts  were  written  to  allow  the  settings  in  aero.csh  to  be  adjusted  based  on  the  master  job 
description  file.  The  job  creation  script  batch_adjoint_multiple.pl  would  first  call  the  Perl 
utility  aeroCSH_maker,pl,  which  would  write  the  initial  section  of  aero.csh  which  contained  the 
user-controlled  settings.  This  initial  section  would  then  be  joined  with  the  remainder  of  the 
default  aero.csh  file  which  was  not  intended  to  be  modified  by  the  user. 

D.4.5  setBoxRunAero.pl 

This  Perl  script  was  created  as  a  wrapper  for  the  actual  command  to  run  CartSD  (./aero.csh). 
Although  most  failed  cases  resulted  from  improper  geometry  triangulations,  occasionally  a  case 
would  fail  even  though  the  triangulation  was  watertight.  Watertightness  was  tested  during  the 
building  phase,  as  described  in  Section  D.2  .  Review  of  the  CartSD  User’s  Group  identified 
other  users  who  had  occasionally  encountered  this  problem.  One  user  suggested  that  a 
perturbation  of  the  outer  mesh  boundary  settings  might  address  the  problem.  [168]  Another  user 
indicated  that  CartSD  would  sometimes  behave  poorly  if  the  volumetric  domain  boundary 
aligned  exactly  with  the  edge  of  the  configuration  being  analyzed.  [178]  In  light  of  these 
suggestions,  it  was  hypothesized  that,  if  a  case  were  found  to  fail  during  CartSD  analysis,  it  might 
run  successfully  if  the  outer  volumetric  boundary  were  shifted. 

The  CartSD  utility  autoinputs  was  used  to  define  the  outer  volumetric  boundary.  This  utility 
initializes  the  volumetric  mesh  at  a  certain  distance,  which  is  proportional  to  the  maximum 
Cartesian  dimension  of  an  axis-aligned  bounding  box  that  encloses  the  object  being 
analyzed.  [124]  In  short,  the  representative  length  L  is  the  maximum  extent  of  the  object  in  any 

one  Cartesian  direction,  and  the  volumetric  mesh  is  defined  as  a  cube  with  sides  of  length  {n  X  L), 
where  n  is  a  user-controlled  scaling  parameter.  CartSD  documentation  indicates  that  for  subsonic 
flow,  the  value  of  n  should  be  roughly  20-30;  this  value  can  be  reduced  for  supersonic  flow.[5] 

Initially,  setBoxRunAero.pl  runs  autoinputs  with  an  n  value,  or  box  size,  of  24.  If  the  case  does 
not  run  to  completion  -  i.e.,  it  does  not  generate  a  file  named  “entire.dat”  in  the  “adapt08” 
folder  -  the  setBoxRunAero.pl  script  will  increment  the  box  size  by  1  and  re-run  autoinputs  and 
aero.csh.  If  the  box  size  exceeds  30,  attempts  to  run  the  case  are  aborted  and  an  empty  “entire.dat” 
file  is  generated  in  the  adapt08  directory  to  avoid  any  future  attempts  to  re-analyze  the  case. 

When  setBoxRunAero.pl  is  run,  it  first  searches  for  “entire.dat”  in  the  folder  “adapt08”.  The 
existence  of  this  file  would  indicate  that  the  current  analysis  has  already  been  completed,  and 
thus  Cart3D  does  not  need  to  be  run  again.  If  the  “entire.dat”  file  does  not  exist  but  “adapt##” 
folders  are  present,  it  is  assumed  that  a  previous  run  ended  unsuccessfully.  This  is  also  true  if  a 
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“STOP”  file  is  found  in  the  working  directory.  If  the  case  did  not  complete  successfully  but 
“adapt##”  folders  or  a  “STOP”  file  is  present,  the  setBoxRunAero.pl  script  will  purge  all 
“adapt##”  folders  and  the  “STOP”  file  and  re-run  CartSD. 

Consultation  with  HPC  systems  help  desks  led  to  the  modification  of  this  script:  the  “Ifs 
setstripe”  command  was  used  to  change  the  number  of  object  storage  targets  (OSTs)  that  would 
be  used  to  write  the  files  to  disk.  The  default  value  is  6,  but  due  to  the  large  number  of 
(relatively)  small  files  generated  by  CartSD,  the  help  desks  requested  that  the  OST  settings  be 
changed  so  that  only  1  OST  was  used  for  each  directory  to  mitigate  system  loads. 

D.4.6  PBS_maker.pl 

This  PBS  file  creation  script  would  be  called  by  the  job  creation  script 

batch_adjoint_multiple.pl.  It  would  draw  upon  information  in  the  master  job  description  file  to 
create  the  PBS  job  scripts  that  would  be  submitted  to  the  queue.  These  job  scripts  included  PBS 
commands  that  controlled  the  number  of  processors  requested,  the  duration  of  the  request,  and 
the  HPC  account  which  should  be  charged  for  the  resources  used. 

Additionally,  the  scripts  contained  the  commands  which  would  be  executed  when  the  job  was 
run.  Typically  these  commands  included  changing  the  current  directory  to  that  of  a  case  to  be 
analyzed,  running  setBoxRunAero.pl  to  analyze  the  case  if  necessary,  and  running  the 
aero_archive.csh  script  after  completing  a  job  to  clean  up  unnecessary  files.  Often  a  single  PBS 
job  file  would  include  10-25  cases.  This  simplified  the  bookkeeping  of  job  resources,  as  it 
reduced  the  overall  number  of  jobs  submitted  to  the  queue  and  simplified  the  workload  of  the 
queue  control  software. 

D.4.7  submit.pl 

This  script  was  fairly  simple  compared  to  the  others.  It  would  check  the  current  directory  for 
any  PBS  job  files  and  submit  them  sequentially  with  a  10-20  second  delay  between  submissions 
so  as  not  to  overload  the  queue  software.  A  subdirectory,  “submittedPBS,”  would  also  be  created 
if  it  did  not  exist;  after  each  PBS  job  was  submitted,  the  job  file  was  moved  to  “submittedPBS” 
so  that  it  would  not  be  re-submitted  if  the  script  were  called  again. 

D.4.8  refreshQueue.pl 

This  script  was  used  primarily  on  the  U.S.  Army  Engineer  Research  &  Development  Center  (ERDC) 
system  called  Diamond.  That  system  offered  high  throughput  of  CartSD  cases  compared  to 
other  systems,  and  it  was  difficult  to  manage  the  queue  to  maximize  the  number  of  analyses 
completed  without  negatively  impacting  the  experience  of  other  users  of  the  system.  Virtually 
unlimited  jobs  could  be  submitted  to  the  queue  at  a  time,  but  too  many  jobs  would  make  it 
difficult  for  other  users  to  find  their  jobs  within  the  queue.  Conversely,  if  too  few  jobs  were 
submitted  at  a  time,  they  might  all  finish  before  the  next  time  the  author  checked  the  queue, 
reflecting  lost  chances  to  complete  more  analyses. 

Rather  than  checking  the  status  of  the  queue  every  few  hours,  this  script  was  written  to  automate 
the  process.  When  run,  the  script  would  query  the  number  of  jobs  waiting  in  the  queue  that 
belonged  to  the  author.  That  number  was  compared  against  a  minimum  limit  defined  in  the  script 
(typically  20).  If  the  author  had  fewer  than  20  jobs  in  the  queue  waiting  to  start,  the  script 


243 

Approved  for  public  release;  distribution  unlimited 


would  query  the  number  of  running  iohs  that  belonged  to  the  author.  If  this  number  were  less  than 
a  eutoflf  (e.g.,  50),  the  seript  would  submit  a  new  PBS  fde  to  the  queue. 

If  a  new  PBS  file  was  to  be  submitted,  the  seript  would  move  to  a  folder  whieh  eontained  PBS 
seripts  to  be  submitted  and  seleet  one  to  submit.  Beeause  many  jobs  eould  be  running 
simultaneously,  each  of  which  might  call  refreshQueue.pl  at  around  the  same  time,  the  script 
could  not  simply  select  the  first  file  in  that  folder.  Such  a  strategy  resulted  in  the  same  job  being 
submitted  multiple  times  before  it  could  be  moved  to  a  different  folder.  Instead,  one  of  the  first  50 
jobs  in  that  folder  was  selected  at  random  for  submission.  The  refreshQueue.pl  script  would  then 
pause  for  a  random  duration  between  45  and  115  seconds  before  re-checking  the  status  of  the 
queue. 

This  script  drastically  reduced  the  user  effort  required  to  ensure  that  an  appropriate  number  of 
jobs  were  in  the  queue  on  Diamond  at  all  times.  The  user  needed  only  to  log  in  sporadically  to 
collect  finished  results  and  to  add  more  PBS  job  files  to  the  appropriate  folder. 

D.4.9  batch_adjoint_multiple.pl 

This  script  performed  all  the  tasks  necessary  to  set  up  a  set  of  cases  for  analysis.  It  would  read 
the  master  job  description  file  and  call  other  scripts  to  generate  the  various  files  required  by 
CartSD.  A  directory  for  the  case  at  hand  would  then  be  created  in  the  scratch  space,  and  all 
necessary  files  moved  to  that  directory.  This  script  would  also  run  the  script  which  created  the 
PBS  job  file  to  run  the  analysis. 

D.4.10  pullCase.pl 

This  script  was  used  to  collect  finished  cases.  The  user  defined  a  batch  name,  a  range  of  case 
ID  values,  and  a  flight  condition  number.  All  cases  which  matched  the  description  would  then  be 
moved  to  a  separate  folder  in  the  scratch  space  corresponding  to  that  particular  batch  and  flight 
condition.  Once  this  move  was  completed,  the  collator_adj.pl  script  was  copied  to  the  new  folder 
and  run  in  order  to  parse  the  output  of  the  analyses. 

D.4.11  collator_adj.pl 

This  script  parsed  the  CartSD  output  files  for  useful  data  and  recorded  it  in  a  convenient  set  of 
files.  In  each  case  directory,  it  used  multiple  files  to  accumulate  the  desired  data  sets. 

First,  the  script  would  read  the  “input,  cntl”  file  and  record  the  Mach  number,  angle  of  attack, 
sideslip  angle  and  reference  area.  Secondly,  it  would  read  the  “clic  lat.cntl”  and  “clic  lon.cntl” 
and  record  the  lateral  and  longitudinal  reference  lengths,  respectively. 

Next,  it  would  open  the  “entire.dat”  file,  which  contains  the  iteration  history  of  the  forces  and 
moments  on  the  vehicle,  and  parse  the  final  30  lines  to  evaluate  convergence.  The  average  and 
standard  deviation  of  each  response  was  calculated.  Because  the  forces  were  in  the  body- 
aligned  frame,  those  forces  were  transformed  using  the  angle  of  attack  and  sideslip  angle  to 
determine  the  lift,  drag  and  side  forces. 

The  next  step  was  to  read  the  various  CLiC  output  files.  Those  files  detailed  the  center  of 
mass  that  was  used  for  the  moment  calculations,  as  well  as  the  final  force  and  moment  values. 
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Additionally,  those  files  included  the  calculated  hinge  moment  acting  on  each  control  surface 
about  its  hinge  line.  Using  the  difference  between  the  whole-body  moments  about  the 
perturbed  and  unperturbed  center  of  mass,  the  change  in  moment  with  respect  to  changing 
center  of  mass  could  be  calculated.  This  quantity  could  be  determined  analytically  as  well;  these 
calculations  were  used  primarily  as  a  sanity  check. 

The  results  were  written  to  a  set  of  output  files.  One  file  contained  the  forces  and  moments  in 
the  form  of  lift,  drag,  side  force,  roll,  pitch,  etc,  as  well  as  the  convergence  data  and  the 
effects  of  COM  changes  on  the  aerodynamic  moments.  Another  file  contained  the  same  results 
but  in  axial-normal-lateral  format.  A  third  file  contained  the  hinge  moments  for  each  case.  The 
fourth  file  contained  only  the  convergence  results.  These  files  were  then  renamed  according  to  the 
current  batch  name  and  copied  to  the  user’s  home  directory  for  easy  collection. 

D.5  Post-Processing 

Once  the  results  had  been  downloaded  from  the  High  Performance  Computing  systems,  they 
were  processed  to  link  each  set  of  results  to  the  corresponding  input  values  and  filter  out  any  cases 
which  did  not  run  correctly.  Because  the  goal  was  to  model  multiple  flight  conditions,  if  a  case  did 
not  run  correctly  for  every  flight  condition,  it  was  discarded  from  all  data  sets. 

The  parsing  script  would  read  in  the  various  output  files  that  had  been  created  by 
collator_adj.pl  and  the  input  deck  that  had  been  used  to  generate  the  cases.  A  line  was  read  from 
each  data  file  and  parsed  for  case  ID;  if  this  ID  was  not  the  same  for  every  data  file,  the  script 
would  throw  an  error  and  halt  operation. 

If  all  ID  numbers  matched,  the  corresponding  case  was  read  from  the  input  deck,  and  the  current 
data  line  in  each  file  would  be  fully  parsed  for  details  such  as  flight  condition,  reference  scale 
values,  and  force  &  moment  results.  The  convergence  data  -  including  average  response  value, 
standard  deviation  of  the  response,  and  the  ratio  of  the  standard  deviation  to  the  average  -  was 
also  parsed. 

An  unexpected  quirk  of  CartSD  was  the  fact  that  the  iteration  history  for  the  aerodynamic 
moments  was  calculated  as  if  the  COM  were  at  (0, 0, 0)  rather  than  the  specified  value.  The 
post-processing  script  corrected  the  iteration  data  to  match  the  specified  value,  along  with  the 
reference  length,  the  average  force  values,  and  the  angle  of  attack  and  sideslip  angle. 

A  number  of  “goodness”  checks  were  then  performed  to  ensure  that  no  nonsensical  results  were 
included  in  the  final  data  set: 

•  If  any  drag  coefficient  was  less  than  or  equal  to  zero,  the  case  would  be  rejected; 

•  if  the  absolute  value  of  the  lift  coefficient  was  greater  than  20,  this  was  taken  as  an 
indication  that  CartSD  had  converged  to  a  nonsensical  answer,  and  the  case  would  be 
rejected; 

•  if  the  absolute  value  of  the  pitching  moment  coefficient  was  greater  than  50,  the  case 
was  considered  to  be  nonsensical  and  rejected;  and 

•  if  the  standard  deviation  of  the  pitching  moment  coelficient  Cm  and  the  ratio  of  the 
standard  deviation  of  Cm  to  the  average  value  of  Cm  were  both  larger  than  0.05,  the  case  was 
considered  to  be  insufficiently  converged,  and  was  rejected. 
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This  final  requirement  deserves  more  attention.  Either  eriterion  alone  would  not  be  sulficient:  a 
case  could  have  a  larger  standard  deviation  and  still  be  converged  if  the  response  value  was  large 
with  respect  to  the  standard  deviation.  Conversely,  a  converged  case  might  have  a  larger  ratio 
of  standard  deviation  to  average  and  still  be  converged  if  the  average  value  were  close  to  zero. 
By  combining  the  two  criteria,  the  only  rejected  cases  would  be  those  which  exhibited  both 
significant  absolute  variability  (i.e.,  large  standard  deviation)  and  significant  relative  variability 
(i.e.,  large  ratio  of  standard  deviation  to  average). 

Cases  which  were  not  rejected  were  included  in  the  combined  output  data  set.  This  data  set 
combined  the  input  settings  for  each  case  with  the  aerodynamic  responses  at  each  flight 
condition  for  ease  of  reference. 
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List  of  Acronyms 


AFRL 

Air  Eorce  Research  Eaboratory 

AGARD 

Advisory  Group  for  Aerospace  Research  and  Development 

AoA 

angle  of  attack 

APAS 

Aerodynamic  Preliminary  Analysis  System 

ASDL 

Aerospace  Systems  Design  Eaboratory 

CBS 

contour-based  sampling 

CDF 

cumulative  distribution  function 

CEASIOM 

Computerized  Environment  for  Aircraft  Synthesis  and  Integrated  Optimisation 
Methods 

CFD 

computational  fluid  dynamics 

COM 

center  of  mass 

DoD 

Department  of  Defense 

ERDC 

U.S.  Army  Engineer  Research  &  Development  Center 

GA 

genetic  algorithm 

HPCC 

high  performance  computing  cluster 

IMSE 

integrated  mean  squared  error 

MEE 

model  fit  error 

MRE 

model  representation  error 

NASA 

National  Aeronautics  and  Space  Administration 

NEHC 

nested  latin  hyper-cube 

OME 

outer  mold  line 

PDE 

probability  density  function 

POI 

proability  of  interest 

RANS 

Reynolds- Averaged  Navier-Stokes 

RBS 

reusable  booster  stage 

RMSE 

root  mean  squared  error 

RSM 

response  surface  methodology 

RTES 

return  to  launch  site 

S/HABP 

Supersonic  Hypersonic  Arbitrary  Body  Program 

UDP 

Unified  Distributed  Panel 

wIMSE 

weighted  Integrated  Mean  Squared  Error 
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List  of  Symbols 


a 

angle  of  attack 

K 

nth  coefficient 

P 

coefficient  vector 

c{x) 

covariance  vector 

c 

covariance  matrix 

8 

stochastic  residual 

C 

random  measurement  error 

£ 

error 

F 

experimental  matrix 

F 

mean 

probability  density  function 

5„(X) 

sample  variance 

variance 

O 

cumulative  distribution  function 

X 

sample  point 

y 

response 

z 

standard  normal  distribution 
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